[Mpi3-ft] MPI_Comm_validate_all

Joshua Hursey jjhursey at open-mpi.org
Fri Feb 18 07:08:48 CST 2011


Yeah. I will add it to the wiki page (hopefully later today). That way we can capture the current thinking, and hopefully encourage others to verify/refine my assessment of the system.

This is also something we should be sure to explain in our (as a working group) eventual paper on the philosophy and interfaces that we are defining here. More on that later though.

Thanks,
Josh

On Feb 17, 2011, at 8:34 PM, Bronevetsky, Greg wrote:

> Josh, this is great and I think the argument is quite convincing. Can we add it as advice to implementers?
> 
> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov
> http://greg.bronevetsky.com
> 
> 
>> -----Original Message-----
>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>> bounces at lists.mpi-forum.org] On Behalf Of Adam T. Moody
>> Sent: Thursday, February 17, 2011 4:50 PM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
>> 
>> Hi Josh,
>> This is such a key piece of the proposal, and this is a very good
>> explanation of the semantics (assuming others don't find a problem).
>> I'd copy this text to a wiki page (and add links to the referenced
>> papers) so we can point others to this.  I'd place this on its own page
>> or at the bottom of the existing proposal text.
>> -Adam
>> 
>> Joshua Hursey wrote:
>> 
>>> So I think the point of the thread is getting kind of muddled for me. Let
>> me try to restate the problem a bit, and add to the discussion. This is a
>> bit long, but it is an important discussion which we should to take care to
>> explain sufficiently so that we are all on the same page.
>>> 
>>> 
>>> It was brought into question that the semantics of MPI_COMM_VALIDATE_ALL()
>> may be impossible to guarantee for the MPI standard text, in particular the
>> semantic that the collective call must "either complete successfully
>> everywhere or return some error everywhere."
>>> 
>>> It was also highlighted that without a consistent, uniform return code
>> (either success or error, never a mix) this function does not have much
>> use. Further, without a fault tolerant consensus protocol it is difficult
>> to create a fault tolerant application, for example such a function is
>> useful/necessary in building a termination detection protocol (not
>> termination faults, but normal program termination at the end of an
>> algorithm) - see Barborak's paper for a general discussion of the role of
>> consensus in fault-tolerant computing.
>>> 
>>> The purpose of the MPI_Comm_validate_all() was to provide users with a
>> fault tolerant consensus protocol, and a uniform way to re-enable
>> collectives on a communicator. The MPI implementation has the option to
>> reorganize collectives at this point if it wishes, but may not need to - if
>> say it provides a full suite of fault tolerant collectives, not just fault
>> aware collectives. Since fault tolerant consensus protocols are difficult
>> and delicate to implement correctly, it is important that the MPI
>> implementation help the application by providing such a feature - and the
>> MPI implementation can then further optimize its implementation for a
>> particular environment.
>>> 
>>> Note that the MPI_Comm_validate_all() function is not the only function
>> that requires uniform return codes. All of the communicator/window/file
>> handle/topology creation functions also have this requirement so that the
>> handle being created is not created some places and not others. All of
>> these functions boil down to a fault tolerant consensus protocol that
>> decides on the return code of the function - deciding the value of the
>> 'outcount' or 'handle' is an extension of this base problem.
>>> 
>>> 
>>> So back to uniform return codes, it is true that in some environments
>> fault tolerant consensus cannot be provided - in particular a completely
>> asynchronous environment, per FLP. The FLP paper by making no assumptions
>> about the relative speeds of processors notes that timeout values are
>> inappropriate for fault detection, therefore a process that is slow is
>> indistinguishable from a process that is dead. However, we are already
>> making the assumption that we can detect process failure, so we are already
>> putting additional constraints on our conceptual system.
>>> 
>>> So the Dwork paper would say that we are assuming partial synchrony since
>> timeout bounds can exist, but may not be fixed. This paper puts fault
>> sensitivity bounds on various protocols for detecting a set of four classes
>> of faults (though we are only concerned with the fail-stop class - which,
>> in this paper, includes what Barborak calls a Crash Fault). One key to the
>> algorithm is that after a process makes a decision, it continues to
>> participate in the protocol. This allows for a peer process to ask other
>> peers if they decided in a particular round.
>>> 
>>> Alternatively, if look to the fault detection literature we find Chandra's
>> paper on unreliable failure detectors in which they demonstrate how
>> unreliable failure detectors can be used to create a perfect failure
>> detector even in an asynchronous system. It is from this paper that we get
>> our definitions of completeness and accuracy for our fault detector
>> assurances that we provide the application. As alternative to this those is
>> Chandra's paper on the impossibly of group membership for asynchronous
>> systems in which the authors caution the reader to not assume that since
>> failure detection is possible in asynchronous systems, group membership may
>> not be. But for this I fall back to my argument that we are actually
>> operating in a partially synchronous environment, not an asynchronous
>> environment so we are making further assumptions about the system than
>> these papers allow.
>>> 
>>> 
>>> Let us pull up a little bit to a concrete example of the uniform return
>> code issue, in particular the one suggested by Darius. Assume a set of 4
>> processes {0,1,2,3} that wish to reach agreement on a return code. Assume
>> that they reach agreement on the value Y, and 0 is broadcasting the final
>> result to all processes in an ordered linear manner. Process 1 and 2
>> receive the value Y and continue executing (though still participating in
>> the protocol). Process 0 fails before sending to 3. 3 detects the failure
>> of 0, and asks 1 and 2 if they know what was decided.
>>> 
>>> If there are not other failures, 3 will decide with 1 and 2 that Y is the
>> correct return code. If however both 1 and 2 fail after they returned Y,
>> but before responding to 3's query, what is 3 to do? Since it is the only
>> alive process left in the system it is free to decide either Y or some
>> other value, say X. This is where we get to the crux of the problem -  if
>> we look at the original question of the semantic that the call must "either
>> complete successfully everywhere or return some error everywhere." 1 and 2
>> returned Y, and 3 does not know what to return - it does know that it is
>> 'undecided' and unable to 'decide'. This is where the blocking vs. non-
>> blocking 2/3-phase commit protocol discussion comes in with databases.
>>> 
>>> According to the semantic, as written, if it is to return from the
>> function, it must return Y since 1 and 2 returned Y before failing.
>> Unfortunately, 3 does not have a mechanism to figure this out. Since 3
>> cannot decide, is it acceptable for it to self terminate since it cannot
>> provide the semantic? - noticing that the protocol did fail after a
>> majority of the participants failed. This seems like a harsh way to
>> maintain the semantic, but might be all we are left with. And in an assumed
>> faulty environment may be an acceptable loss to provide the necessary
>> consistency guarentees that the application is desiring from this function.
>>> 
>>> If we were to loosen the semantics, how do we propose that we do so while
>> still providing the application with a usable set of semantics for
>> consistent execution? It may help to think about consistently returning
>> from MPI_Comm_create() instead of MPI_Comm_validate_all().
>>> 
>>> 
>>> There is an escape clause for MPI implementations operating in an
>> environment in which fault tolerant consensus is impossible - the MPI
>> implementation does not have to provide the MPI_Comm_validate_all() (and
>> other similar) functionality, and applications must deal without it. This
>> is similar to how certain environments do not provide comm_spwan and
>> friends for environments in which they are not supported/appropriate. So in
>> essence, if you cannot provide fault detection capabilities, and are unable
>> to provide fault tolerant consensus then do not provide MPI fault tolerance
>> semantics. However, most environments will be able to do so with various
>> degrees of effort.
>>> 
>>> 
>>> Of course, as may often be the case, I can be incorrect in my assessment
>> of the situation. This is now I have been thinking about the literature as
>> it applies to the proposal, and recognize that I am not an expert in the
>> field of fault tolerant agreement or consensus protocols so there may be
>> gaps in my understanding or assessment of the situation. But I should note
>> that the particular point of this thread is critical to the success or
>> failure of the acceptance of this proposal to the MPI standardization body.
>> So making sure we are all on the same page, with the same understanding is
>> important.
>>> 
>>> -- Josh
>>> 
>>> P.S. If you are interested in the Paxos algorithm, I would suggest reading
>> Chandra's 2007 paper describing their work with it for Google. It is an
>> interesting read for folks that are trying to actually implement correct
>> consensus algorithms at scale.
>>> 
>>> Referenced Papers (in order of appearance):
>>> -----------------
>>> Barborak: "The consensus problem in fault-tolerant computing" - 1993.
>>> FLP: "Impossibility of distributed consensus with one faulty process" -
>> 1985.
>>> Dwork: "Consensus in the Presence of Partial Synchrony" - 1988.
>>> Chandra: "Unreliable failure detectors for reliable distributed systems" -
>> 1996.
>>> Chandra: "On the impossibility of group membership" - 1996.
>>> Chandra: "Paxos made live: and engineering perspective" - 2007.
>>> 
>>> 
>>> On Feb 16, 2011, at 4:43 PM, Solt, David George wrote:
>>> 
>>> 
>>> 
>>>> In our implementation, we can't guarantee consensus (since it is not
>> possible), but the successful ranks are aware of which ranks may have
>> reached an incompatible conclusion and so they proactively "break" the
>> virtual connections to those processes so that when they known failed ranks
>> attempt to use their wrong communicator, they will get failures and not
>> hang.   I couldn't come up with a way however to ensure that all ranks have
>> perfect consensus in the presence of arbitrary failures.
>>>> 
>>>> For example, it is possible that 0,1,2,3 call regroup.  Due to a late
>> failure during the algorithm, 0 thinks the group is {0,1}, 1 thinks the
>> group is {0,1} and 2 thinks the group is {1,2} and 3 thinks it is not part
>> of the new group.   In this case, rank 0-1 will close the virtual
>> connections to ranks 2 and 3, so rank 2 will not hang when it tries to use
>> its invalid group.
>>>> 
>>>> Our assumption is that once a rank is excluded from the group, it cannot
>> be part of the comm ever again.  (It could use connect/accept/comm_merge to
>> join the other processes using a new communicator, but it cannot attempt to
>> regroup the original communicator again).
>>>> 
>>>> I agree that a return code of "the regroup had problems, please try
>> again" makes no sense and cannot be useful.
>>>> 
>>>> Dave
>>>> 
>>>> -----Original Message-----
>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>> bounces at lists.mpi-forum.org] On Behalf Of Thomas Herault
>>>> Sent: Wednesday, February 16, 2011 3:35 PM
>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
>>>> 
>>>> If we allow the call to return successfully at some nodes, and an error
>> at others, we defeat the reason of existence of this call.
>>>> 
>>>> If some of them detect the failure, and others don't, some will enter the
>> call (let say A detected the failure, and entered validate again, to
>> acknowledge it), others (B) will enter other communications, e.g.
>> mpi_recv(A), which will never return an error, because communication with A
>> is legitimate, but A is not doing the send, it's trying to revalidate the
>> communicator, which it cannot, because B does not enter the call. The MPI
>> application is erroneous, but could not have been correct: consensus
>> semantics on at least one collective operation is required to allow for a
>> collective repair.
>>>> 
>>>> Thomas
>>>> 
>>>> Le 16 févr. 2011 à 16:24, Bronevetsky, Greg a écrit :
>>>> 
>>>> 
>>>> 
>>>>> Actually, I think Darius has a point. The exact guarantee in impossible
>> in the general case because its reducible to the consensus problem.
>> Unfortunately, the spec has to assume the general case, while databases
>> don't need to and can assume synchronous communication or bounds on message
>> delivery times. I think it'll be safer to use Darius' suggestion:
>> guaranteed to return the same thing on processes where it does return
>> something.
>>>>> 
>>>>> Greg Bronevetsky
>>>>> Lawrence Livermore National Lab
>>>>> (925) 424-5756
>>>>> bronevetsky at llnl.gov
>>>>> http://greg.bronevetsky.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>>>> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
>>>>>> Sent: Wednesday, February 16, 2011 1:17 PM
>>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>>> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
>>>>>> 
>>>>>> It is a challenging guarantee to provide, but possible. Databases need
>> to
>>>>>> make decisions like this all time with transactions (commit=success, or
>>>>>> abort=failure). Though database transaction protocols are a good place
>> to
>>>>>> start, we can likely loosen some of the restrictions since we are
>> applying
>>>>>> them to a slightly different environment.
>>>>>> 
>>>>>> Look at a two-phase commit protocol that includes a termination
>> protocol
>>>>>> (Grey), or a three-phase commit protocol (Skeen). The trick is that you
>>>>>> really want what the literature calls a 'nonblocking' commit protocol,
>>>>>> meaning that it will not block in an undecided state waiting for the
>>>>>> recovery of a peer process that might be able to decide from a recovery
>>>>>> log. There are a few other more scalable approaches out there, but are
>>>>>> challenging to implement correctly.
>>>>>> 
>>>>>> -- Josh
>>>>>> 
>>>>>> Gray: Notes on Data Base Operating Systems (note this describes a
>> protocol
>>>>>> without the termination protocol, but a databases text should be able
>> to
>>>>>> fill in that part) - 1979
>>>>>> 
>>>>>> Skeen: Nonblocking commit protocols - 1981
>>>>>> 
>>>>>> On Feb 16, 2011, at 3:49 PM, Darius Buntinas wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> MPI_Comm_validate_all, according to the proposal at [1], must "either
>>>>>>> 
>>>>>>> 
>>>>>> complete successfully everywhere or return some error everywhere."  Is
>> this
>>>>>> possible to guarantee?  What about process failures during the call?
>>>>>> Consider the last message sent in the protocol.  If the process sending
>>>>>> that message dies just before sending it, the receiver will not know
>>>>>> whether to return success or failure.
>>>>>> 
>>>>>> 
>>>>>>> I think that the best we can do is say that the outcount and list of
>>>>>>> 
>>>>>>> 
>>>>>> collectively-detected dead processes will be the same at all processes
>>>>>> where the call completed successfully.
>>>>>> 
>>>>>> 
>>>>>>> Or is there a trick I'm missing?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> -d
>>>>>>> 
>>>>>>> [1] https://svn.mpi-forum.org/trac/mpi-forum-
>>>>>>> 
>>>>>>> 
>>>>>> web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
>>>>>> 
>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> ------------------------------------
>>>>>> Joshua Hursey
>>>>>> Postdoctoral Research Associate
>>>>>> Oak Ridge National Laboratory
>>>>>> http://users.nccs.gov/~jjhursey
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>>> 
>>> 
>>> ------------------------------------
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey





More information about the mpiwg-ft mailing list