jjhursey at open-mpi.org
Wed Feb 16 15:16:50 CST 2011
It is a challenging guarantee to provide, but possible. Databases need to make decisions like this all time with transactions (commit=success, or abort=failure). Though database transaction protocols are a good place to start, we can likely loosen some of the restrictions since we are applying them to a slightly different environment.
Look at a two-phase commit protocol that includes a termination protocol (Grey), or a three-phase commit protocol (Skeen). The trick is that you really want what the literature calls a 'nonblocking' commit protocol, meaning that it will not block in an undecided state waiting for the recovery of a peer process that might be able to decide from a recovery log. There are a few other more scalable approaches out there, but are challenging to implement correctly.
Gray: Notes on Data Base Operating Systems (note this describes a protocol without the termination protocol, but a databases text should be able to fill in that part) - 1979
Skeen: Nonblocking commit protocols - 1981
On Feb 16, 2011, at 3:49 PM, Darius Buntinas wrote:
> MPI_Comm_validate_all, according to the proposal at , must "either complete successfully everywhere or return some error everywhere." Is this possible to guarantee? What about process failures during the call? Consider the last message sent in the protocol. If the process sending that message dies just before sending it, the receiver will not know whether to return success or failure.
> I think that the best we can do is say that the outcount and list of collectively-detected dead processes will be the same at all processes where the call completed successfully.
> Or is there a trick I'm missing?
>  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft