[Mpi3-ft] MPI_Comm_validate_all

Thu Feb 17 19:34:45 CST 2011

Josh, this is great and I think the argument is quite convincing. Can we add it as advice to implementers?

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Adam T. Moody
> Sent: Thursday, February 17, 2011 4:50 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
>
> Hi Josh,
> This is such a key piece of the proposal, and this is a very good
> explanation of the semantics (assuming others don't find a problem).
> I'd copy this text to a wiki page (and add links to the referenced
> papers) so we can point others to this.  I'd place this on its own page
> or at the bottom of the existing proposal text.
> -Adam
>
> Joshua Hursey wrote:
>
> >So I think the point of the thread is getting kind of muddled for me. Let
> me try to restate the problem a bit, and add to the discussion. This is a
> bit long, but it is an important discussion which we should to take care to
> explain sufficiently so that we are all on the same page.
> >
> >
> >It was brought into question that the semantics of MPI_COMM_VALIDATE_ALL()
> may be impossible to guarantee for the MPI standard text, in particular the
> semantic that the collective call must "either complete successfully
> everywhere or return some error everywhere."
> >
> >It was also highlighted that without a consistent, uniform return code
> (either success or error, never a mix) this function does not have much
> use. Further, without a fault tolerant consensus protocol it is difficult
> to create a fault tolerant application, for example such a function is
> useful/necessary in building a termination detection protocol (not
> termination faults, but normal program termination at the end of an
> algorithm) - see Barborak's paper for a general discussion of the role of
> consensus in fault-tolerant computing.
> >
> >The purpose of the MPI_Comm_validate_all() was to provide users with a
> fault tolerant consensus protocol, and a uniform way to re-enable
> collectives on a communicator. The MPI implementation has the option to
> reorganize collectives at this point if it wishes, but may not need to - if
> say it provides a full suite of fault tolerant collectives, not just fault
> aware collectives. Since fault tolerant consensus protocols are difficult
> and delicate to implement correctly, it is important that the MPI
> implementation help the application by providing such a feature - and the
> MPI implementation can then further optimize its implementation for a
> particular environment.
> >
> >Note that the MPI_Comm_validate_all() function is not the only function
> that requires uniform return codes. All of the communicator/window/file
> handle/topology creation functions also have this requirement so that the
> handle being created is not created some places and not others. All of
> these functions boil down to a fault tolerant consensus protocol that
> decides on the return code of the function - deciding the value of the
> 'outcount' or 'handle' is an extension of this base problem.
> >
> >
> >So back to uniform return codes, it is true that in some environments
> fault tolerant consensus cannot be provided - in particular a completely
> asynchronous environment, per FLP. The FLP paper by making no assumptions
> about the relative speeds of processors notes that timeout values are
> inappropriate for fault detection, therefore a process that is slow is
> indistinguishable from a process that is dead. However, we are already
> making the assumption that we can detect process failure, so we are already
> putting additional constraints on our conceptual system.
> >
> >So the Dwork paper would say that we are assuming partial synchrony since
> timeout bounds can exist, but may not be fixed. This paper puts fault
> sensitivity bounds on various protocols for detecting a set of four classes
> of faults (though we are only concerned with the fail-stop class - which,
> in this paper, includes what Barborak calls a Crash Fault). One key to the
> algorithm is that after a process makes a decision, it continues to
> participate in the protocol. This allows for a peer process to ask other
> peers if they decided in a particular round.
> >
> >Alternatively, if look to the fault detection literature we find Chandra's
> paper on unreliable failure detectors in which they demonstrate how
> unreliable failure detectors can be used to create a perfect failure
> detector even in an asynchronous system. It is from this paper that we get
> our definitions of completeness and accuracy for our fault detector
> assurances that we provide the application. As alternative to this those is
> Chandra's paper on the impossibly of group membership for asynchronous
> systems in which the authors caution the reader to not assume that since
> failure detection is possible in asynchronous systems, group membership may
> not be. But for this I fall back to my argument that we are actually
> operating in a partially synchronous environment, not an asynchronous
> environment so we are making further assumptions about the system than
> these papers allow.
> >
> >
> >Let us pull up a little bit to a concrete example of the uniform return
> code issue, in particular the one suggested by Darius. Assume a set of 4
> processes {0,1,2,3} that wish to reach agreement on a return code. Assume
> that they reach agreement on the value Y, and 0 is broadcasting the final
> result to all processes in an ordered linear manner. Process 1 and 2
> receive the value Y and continue executing (though still participating in
> the protocol). Process 0 fails before sending to 3. 3 detects the failure
> of 0, and asks 1 and 2 if they know what was decided.
> >
> >If there are not other failures, 3 will decide with 1 and 2 that Y is the
> correct return code. If however both 1 and 2 fail after they returned Y,
> but before responding to 3's query, what is 3 to do? Since it is the only
> alive process left in the system it is free to decide either Y or some
> other value, say X. This is where we get to the crux of the problem -  if
> we look at the original question of the semantic that the call must "either
> complete successfully everywhere or return some error everywhere." 1 and 2
> returned Y, and 3 does not know what to return - it does know that it is
> 'undecided' and unable to 'decide'. This is where the blocking vs. non-
> blocking 2/3-phase commit protocol discussion comes in with databases.
> >
> >According to the semantic, as written, if it is to return from the
> function, it must return Y since 1 and 2 returned Y before failing.
> Unfortunately, 3 does not have a mechanism to figure this out. Since 3
> cannot decide, is it acceptable for it to self terminate since it cannot
> provide the semantic? - noticing that the protocol did fail after a
> majority of the participants failed. This seems like a harsh way to
> maintain the semantic, but might be all we are left with. And in an assumed
> faulty environment may be an acceptable loss to provide the necessary
> consistency guarentees that the application is desiring from this function.
> >
> >If we were to loosen the semantics, how do we propose that we do so while
> still providing the application with a usable set of semantics for
> consistent execution? It may help to think about consistently returning
> from MPI_Comm_create() instead of MPI_Comm_validate_all().
> >
> >
> >There is an escape clause for MPI implementations operating in an
> environment in which fault tolerant consensus is impossible - the MPI
> implementation does not have to provide the MPI_Comm_validate_all() (and
> other similar) functionality, and applications must deal without it. This
> is similar to how certain environments do not provide comm_spwan and
> friends for environments in which they are not supported/appropriate. So in
> essence, if you cannot provide fault detection capabilities, and are unable
> to provide fault tolerant consensus then do not provide MPI fault tolerance
> semantics. However, most environments will be able to do so with various
> degrees of effort.
> >
> >
> >Of course, as may often be the case, I can be incorrect in my assessment
> of the situation. This is now I have been thinking about the literature as
> it applies to the proposal, and recognize that I am not an expert in the
> field of fault tolerant agreement or consensus protocols so there may be
> gaps in my understanding or assessment of the situation. But I should note
> that the particular point of this thread is critical to the success or
> failure of the acceptance of this proposal to the MPI standardization body.
> So making sure we are all on the same page, with the same understanding is
> important.
> >
> >-- Josh
> >
> >P.S. If you are interested in the Paxos algorithm, I would suggest reading
> Chandra's 2007 paper describing their work with it for Google. It is an
> interesting read for folks that are trying to actually implement correct
> consensus algorithms at scale.
> >
> >Referenced Papers (in order of appearance):
> >-----------------
> >Barborak: "The consensus problem in fault-tolerant computing" - 1993.
> >FLP: "Impossibility of distributed consensus with one faulty process" -
> 1985.
> >Dwork: "Consensus in the Presence of Partial Synchrony" - 1988.
> >Chandra: "Unreliable failure detectors for reliable distributed systems" -
> 1996.
> >Chandra: "On the impossibility of group membership" - 1996.
> >Chandra: "Paxos made live: and engineering perspective" - 2007.
> >
> >
> >On Feb 16, 2011, at 4:43 PM, Solt, David George wrote:
> >
> >
> >
> >>In our implementation, we can't guarantee consensus (since it is not
> possible), but the successful ranks are aware of which ranks may have
> reached an incompatible conclusion and so they proactively "break" the
> virtual connections to those processes so that when they known failed ranks
> attempt to use their wrong communicator, they will get failures and not
> hang.   I couldn't come up with a way however to ensure that all ranks have
> perfect consensus in the presence of arbitrary failures.
> >>
> >>For example, it is possible that 0,1,2,3 call regroup.  Due to a late
> failure during the algorithm, 0 thinks the group is {0,1}, 1 thinks the
> group is {0,1} and 2 thinks the group is {1,2} and 3 thinks it is not part
> of the new group.   In this case, rank 0-1 will close the virtual
> connections to ranks 2 and 3, so rank 2 will not hang when it tries to use
> its invalid group.
> >>
> >>Our assumption is that once a rank is excluded from the group, it cannot
> be part of the comm ever again.  (It could use connect/accept/comm_merge to
> join the other processes using a new communicator, but it cannot attempt to
> regroup the original communicator again).
> >>
> >>I agree that a return code of "the regroup had problems, please try
> again" makes no sense and cannot be useful.
> >>
> >>Dave
> >>
> >>-----Original Message-----
> >>From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Thomas Herault
> >>Sent: Wednesday, February 16, 2011 3:35 PM
> >>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >>Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
> >>
> >>If we allow the call to return successfully at some nodes, and an error
> at others, we defeat the reason of existence of this call.
> >>
> >>If some of them detect the failure, and others don't, some will enter the
> call (let say A detected the failure, and entered validate again, to
> acknowledge it), others (B) will enter other communications, e.g.
> mpi_recv(A), which will never return an error, because communication with A
> is legitimate, but A is not doing the send, it's trying to revalidate the
> communicator, which it cannot, because B does not enter the call. The MPI
> application is erroneous, but could not have been correct: consensus
> semantics on at least one collective operation is required to allow for a
> collective repair.
> >>
> >>Thomas
> >>
> >>Le 16 févr. 2011 à 16:24, Bronevetsky, Greg a écrit :
> >>
> >>
> >>
> >>>Actually, I think Darius has a point. The exact guarantee in impossible
> in the general case because its reducible to the consensus problem.
> Unfortunately, the spec has to assume the general case, while databases
> don't need to and can assume synchronous communication or bounds on message
> delivery times. I think it'll be safer to use Darius' suggestion:
> guaranteed to return the same thing on processes where it does return
> something.
> >>>
> >>>Greg Bronevetsky
> >>>Lawrence Livermore National Lab
> >>>(925) 424-5756
> >>>bronevetsky at llnl.gov
> >>>http://greg.bronevetsky.com
> >>>
> >>>
> >>>
> >>>
> >>>>-----Original Message-----
> >>>>From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> >>>>bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
> >>>>Sent: Wednesday, February 16, 2011 1:17 PM
> >>>>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >>>>Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
> >>>>
> >>>>It is a challenging guarantee to provide, but possible. Databases need
> to
> >>>>make decisions like this all time with transactions (commit=success, or
> >>>>abort=failure). Though database transaction protocols are a good place
> to
> >>>>start, we can likely loosen some of the restrictions since we are
> applying
> >>>>them to a slightly different environment.
> >>>>
> >>>>Look at a two-phase commit protocol that includes a termination
> protocol
> >>>>(Grey), or a three-phase commit protocol (Skeen). The trick is that you
> >>>>really want what the literature calls a 'nonblocking' commit protocol,
> >>>>meaning that it will not block in an undecided state waiting for the
> >>>>recovery of a peer process that might be able to decide from a recovery
> >>>>log. There are a few other more scalable approaches out there, but are
> >>>>challenging to implement correctly.
> >>>>
> >>>>-- Josh
> >>>>
> >>>>Gray: Notes on Data Base Operating Systems (note this describes a
> protocol
> >>>>without the termination protocol, but a databases text should be able
> to
> >>>>fill in that part) - 1979
> >>>>
> >>>>Skeen: Nonblocking commit protocols - 1981
> >>>>
> >>>>On Feb 16, 2011, at 3:49 PM, Darius Buntinas wrote:
> >>>>
> >>>>
> >>>>
> >>>>>MPI_Comm_validate_all, according to the proposal at [1], must "either
> >>>>>
> >>>>>
> >>>>complete successfully everywhere or return some error everywhere."  Is
> this
> >>>>possible to guarantee?  What about process failures during the call?
> >>>>Consider the last message sent in the protocol.  If the process sending
> >>>>that message dies just before sending it, the receiver will not know
> >>>>whether to return success or failure.
> >>>>
> >>>>
> >>>>>I think that the best we can do is say that the outcount and list of
> >>>>>
> >>>>>
> >>>>collectively-detected dead processes will be the same at all processes
> >>>>where the call completed successfully.
> >>>>
> >>>>
> >>>>>Or is there a trick I'm missing?
> >>>>>
> >>>>>Thanks,
> >>>>>-d
> >>>>>
> >>>>>[1] https://svn.mpi-forum.org/trac/mpi-forum-
> >>>>>
> >>>>>
> >>>>web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
> >>>>
> >>>>
> >>>>>_______________________________________________
> >>>>>mpi3-ft mailing list
> >>>>>mpi3-ft at lists.mpi-forum.org
> >>>>>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>>>
> >>>>>
> >>>>>
> >>>>------------------------------------
> >>>>Joshua Hursey
> >>>>Postdoctoral Research Associate
> >>>>Oak Ridge National Laboratory
> >>>>http://users.nccs.gov/~jjhursey
> >>>>
> >>>>
> >>>>_______________________________________________
> >>>>mpi3-ft mailing list
> >>>>mpi3-ft at lists.mpi-forum.org
> >>>>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>>
> >>>>
> >>>_______________________________________________
> >>>mpi3-ft mailing list
> >>>mpi3-ft at lists.mpi-forum.org
> >>>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>_______________________________________________
> >>mpi3-ft mailing list
> >>mpi3-ft at lists.mpi-forum.org
> >>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>_______________________________________________
> >>mpi3-ft mailing list
> >>mpi3-ft at lists.mpi-forum.org
> >>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >
> >------------------------------------
> >Joshua Hursey
> >Postdoctoral Research Associate
> >Oak Ridge National Laboratory
> >http://users.nccs.gov/~jjhursey
> >
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft