[Mpi3-ft] MPI_Comm_validate_all

Bronevetsky, Greg bronevetsky1 at llnl.gov
Wed Feb 16 15:24:38 CST 2011

Actually, I think Darius has a point. The exact guarantee in impossible in the general case because its reducible to the consensus problem. Unfortunately, the spec has to assume the general case, while databases don't need to and can assume synchronous communication or bounds on message delivery times. I think it'll be safer to use Darius' suggestion: guaranteed to return the same thing on processes where it does return something.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
> Sent: Wednesday, February 16, 2011 1:17 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
> It is a challenging guarantee to provide, but possible. Databases need to
> make decisions like this all time with transactions (commit=success, or
> abort=failure). Though database transaction protocols are a good place to
> start, we can likely loosen some of the restrictions since we are applying
> them to a slightly different environment.
> Look at a two-phase commit protocol that includes a termination protocol
> (Grey), or a three-phase commit protocol (Skeen). The trick is that you
> really want what the literature calls a 'nonblocking' commit protocol,
> meaning that it will not block in an undecided state waiting for the
> recovery of a peer process that might be able to decide from a recovery
> log. There are a few other more scalable approaches out there, but are
> challenging to implement correctly.
> -- Josh
> Gray: Notes on Data Base Operating Systems (note this describes a protocol
> without the termination protocol, but a databases text should be able to
> fill in that part) - 1979
> Skeen: Nonblocking commit protocols - 1981
> On Feb 16, 2011, at 3:49 PM, Darius Buntinas wrote:
> >
> > MPI_Comm_validate_all, according to the proposal at [1], must "either
> complete successfully everywhere or return some error everywhere."  Is this
> possible to guarantee?  What about process failures during the call?
> Consider the last message sent in the protocol.  If the process sending
> that message dies just before sending it, the receiver will not know
> whether to return success or failure.
> >
> > I think that the best we can do is say that the outcount and list of
> collectively-detected dead processes will be the same at all processes
> where the call completed successfully.
> >
> > Or is there a trick I'm missing?
> >
> > Thanks,
> > -d
> >
> > [1] https://svn.mpi-forum.org/trac/mpi-forum-
> web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list