[Mpi3-ft] MPI_Comm_validate_all

Joshua Hursey jjhursey at open-mpi.org
Thu Feb 17 08:59:36 CST 2011

So I think the point of the thread is getting kind of muddled for me. Let me try to restate the problem a bit, and add to the discussion. This is a bit long, but it is an important discussion which we should to take care to explain sufficiently so that we are all on the same page.

It was brought into question that the semantics of MPI_COMM_VALIDATE_ALL() may be impossible to guarantee for the MPI standard text, in particular the semantic that the collective call must "either complete successfully everywhere or return some error everywhere."

It was also highlighted that without a consistent, uniform return code (either success or error, never a mix) this function does not have much use. Further, without a fault tolerant consensus protocol it is difficult to create a fault tolerant application, for example such a function is useful/necessary in building a termination detection protocol (not termination faults, but normal program termination at the end of an algorithm) - see Barborak's paper for a general discussion of the role of consensus in fault-tolerant computing.

The purpose of the MPI_Comm_validate_all() was to provide users with a fault tolerant consensus protocol, and a uniform way to re-enable collectives on a communicator. The MPI implementation has the option to reorganize collectives at this point if it wishes, but may not need to - if say it provides a full suite of fault tolerant collectives, not just fault aware collectives. Since fault tolerant consensus protocols are difficult and delicate to implement correctly, it is important that the MPI implementation help the application by providing such a feature - and the MPI implementation can then further optimize its implementation for a particular environment.

Note that the MPI_Comm_validate_all() function is not the only function that requires uniform return codes. All of the communicator/window/file handle/topology creation functions also have this requirement so that the handle being created is not created some places and not others. All of these functions boil down to a fault tolerant consensus protocol that decides on the return code of the function - deciding the value of the 'outcount' or 'handle' is an extension of this base problem.

So back to uniform return codes, it is true that in some environments fault tolerant consensus cannot be provided - in particular a completely asynchronous environment, per FLP. The FLP paper by making no assumptions about the relative speeds of processors notes that timeout values are inappropriate for fault detection, therefore a process that is slow is indistinguishable from a process that is dead. However, we are already making the assumption that we can detect process failure, so we are already putting additional constraints on our conceptual system.

So the Dwork paper would say that we are assuming partial synchrony since timeout bounds can exist, but may not be fixed. This paper puts fault sensitivity bounds on various protocols for detecting a set of four classes of faults (though we are only concerned with the fail-stop class - which, in this paper, includes what Barborak calls a Crash Fault). One key to the algorithm is that after a process makes a decision, it continues to participate in the protocol. This allows for a peer process to ask other peers if they decided in a particular round.

Alternatively, if look to the fault detection literature we find Chandra's paper on unreliable failure detectors in which they demonstrate how unreliable failure detectors can be used to create a perfect failure detector even in an asynchronous system. It is from this paper that we get our definitions of completeness and accuracy for our fault detector assurances that we provide the application. As alternative to this those is Chandra's paper on the impossibly of group membership for asynchronous systems in which the authors caution the reader to not assume that since failure detection is possible in asynchronous systems, group membership may not be. But for this I fall back to my argument that we are actually operating in a partially synchronous environment, not an asynchronous environment so we are making further assumptions about the system than these papers allow.

Let us pull up a little bit to a concrete example of the uniform return code issue, in particular the one suggested by Darius. Assume a set of 4 processes {0,1,2,3} that wish to reach agreement on a return code. Assume that they reach agreement on the value Y, and 0 is broadcasting the final result to all processes in an ordered linear manner. Process 1 and 2 receive the value Y and continue executing (though still participating in the protocol). Process 0 fails before sending to 3. 3 detects the failure of 0, and asks 1 and 2 if they know what was decided.

If there are not other failures, 3 will decide with 1 and 2 that Y is the correct return code. If however both 1 and 2 fail after they returned Y, but before responding to 3's query, what is 3 to do? Since it is the only alive process left in the system it is free to decide either Y or some other value, say X. This is where we get to the crux of the problem -  if we look at the original question of the semantic that the call must "either complete successfully everywhere or return some error everywhere." 1 and 2 returned Y, and 3 does not know what to return - it does know that it is 'undecided' and unable to 'decide'. This is where the blocking vs. non-blocking 2/3-phase commit protocol discussion comes in with databases.

According to the semantic, as written, if it is to return from the function, it must return Y since 1 and 2 returned Y before failing. Unfortunately, 3 does not have a mechanism to figure this out. Since 3 cannot decide, is it acceptable for it to self terminate since it cannot provide the semantic? - noticing that the protocol did fail after a majority of the participants failed. This seems like a harsh way to maintain the semantic, but might be all we are left with. And in an assumed faulty environment may be an acceptable loss to provide the necessary consistency guarentees that the application is desiring from this function.

If we were to loosen the semantics, how do we propose that we do so while still providing the application with a usable set of semantics for consistent execution? It may help to think about consistently returning from MPI_Comm_create() instead of MPI_Comm_validate_all().

There is an escape clause for MPI implementations operating in an environment in which fault tolerant consensus is impossible - the MPI implementation does not have to provide the MPI_Comm_validate_all() (and other similar) functionality, and applications must deal without it. This is similar to how certain environments do not provide comm_spwan and friends for environments in which they are not supported/appropriate. So in essence, if you cannot provide fault detection capabilities, and are unable to provide fault tolerant consensus then do not provide MPI fault tolerance semantics. However, most environments will be able to do so with various degrees of effort.

Of course, as may often be the case, I can be incorrect in my assessment of the situation. This is now I have been thinking about the literature as it applies to the proposal, and recognize that I am not an expert in the field of fault tolerant agreement or consensus protocols so there may be gaps in my understanding or assessment of the situation. But I should note that the particular point of this thread is critical to the success or failure of the acceptance of this proposal to the MPI standardization body. So making sure we are all on the same page, with the same understanding is important.

-- Josh

P.S. If you are interested in the Paxos algorithm, I would suggest reading Chandra's 2007 paper describing their work with it for Google. It is an interesting read for folks that are trying to actually implement correct consensus algorithms at scale.

Referenced Papers (in order of appearance):
Barborak: "The consensus problem in fault-tolerant computing" - 1993.
FLP: "Impossibility of distributed consensus with one faulty process" - 1985.
Dwork: "Consensus in the Presence of Partial Synchrony" - 1988.
Chandra: "Unreliable failure detectors for reliable distributed systems" - 1996.
Chandra: "On the impossibility of group membership" - 1996.
Chandra: "Paxos made live: and engineering perspective" - 2007.

On Feb 16, 2011, at 4:43 PM, Solt, David George wrote:

> In our implementation, we can't guarantee consensus (since it is not possible), but the successful ranks are aware of which ranks may have reached an incompatible conclusion and so they proactively "break" the virtual connections to those processes so that when they known failed ranks attempt to use their wrong communicator, they will get failures and not hang.   I couldn't come up with a way however to ensure that all ranks have perfect consensus in the presence of arbitrary failures.   
> For example, it is possible that 0,1,2,3 call regroup.  Due to a late failure during the algorithm, 0 thinks the group is {0,1}, 1 thinks the group is {0,1} and 2 thinks the group is {1,2} and 3 thinks it is not part of the new group.   In this case, rank 0-1 will close the virtual connections to ranks 2 and 3, so rank 2 will not hang when it tries to use its invalid group.  
> Our assumption is that once a rank is excluded from the group, it cannot be part of the comm ever again.  (It could use connect/accept/comm_merge to join the other processes using a new communicator, but it cannot attempt to regroup the original communicator again).  
> I agree that a return code of "the regroup had problems, please try again" makes no sense and cannot be useful.
> Dave  
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Thomas Herault
> Sent: Wednesday, February 16, 2011 3:35 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
> If we allow the call to return successfully at some nodes, and an error at others, we defeat the reason of existence of this call.
> If some of them detect the failure, and others don't, some will enter the call (let say A detected the failure, and entered validate again, to acknowledge it), others (B) will enter other communications, e.g. mpi_recv(A), which will never return an error, because communication with A is legitimate, but A is not doing the send, it's trying to revalidate the communicator, which it cannot, because B does not enter the call. The MPI application is erroneous, but could not have been correct: consensus semantics on at least one collective operation is required to allow for a collective repair.
> Thomas
> Le 16 févr. 2011 à 16:24, Bronevetsky, Greg a écrit :
>> Actually, I think Darius has a point. The exact guarantee in impossible in the general case because its reducible to the consensus problem. Unfortunately, the spec has to assume the general case, while databases don't need to and can assume synchronous communication or bounds on message delivery times. I think it'll be safer to use Darius' suggestion: guaranteed to return the same thing on processes where it does return something.
>> Greg Bronevetsky
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky at llnl.gov
>> http://greg.bronevetsky.com 
>>> -----Original Message-----
>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>> bounces at lists.mpi-forum.org] On Behalf Of Joshua Hursey
>>> Sent: Wednesday, February 16, 2011 1:17 PM
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> Subject: Re: [Mpi3-ft] MPI_Comm_validate_all
>>> It is a challenging guarantee to provide, but possible. Databases need to
>>> make decisions like this all time with transactions (commit=success, or
>>> abort=failure). Though database transaction protocols are a good place to
>>> start, we can likely loosen some of the restrictions since we are applying
>>> them to a slightly different environment.
>>> Look at a two-phase commit protocol that includes a termination protocol
>>> (Grey), or a three-phase commit protocol (Skeen). The trick is that you
>>> really want what the literature calls a 'nonblocking' commit protocol,
>>> meaning that it will not block in an undecided state waiting for the
>>> recovery of a peer process that might be able to decide from a recovery
>>> log. There are a few other more scalable approaches out there, but are
>>> challenging to implement correctly.
>>> -- Josh
>>> Gray: Notes on Data Base Operating Systems (note this describes a protocol
>>> without the termination protocol, but a databases text should be able to
>>> fill in that part) - 1979
>>> Skeen: Nonblocking commit protocols - 1981
>>> On Feb 16, 2011, at 3:49 PM, Darius Buntinas wrote:
>>>> MPI_Comm_validate_all, according to the proposal at [1], must "either
>>> complete successfully everywhere or return some error everywhere."  Is this
>>> possible to guarantee?  What about process failures during the call?
>>> Consider the last message sent in the protocol.  If the process sending
>>> that message dies just before sending it, the receiver will not know
>>> whether to return success or failure.
>>>> I think that the best we can do is say that the outcount and list of
>>> collectively-detected dead processes will be the same at all processes
>>> where the call completed successfully.
>>>> Or is there a trick I'm missing?
>>>> Thanks,
>>>> -d
>>>> [1] https://svn.mpi-forum.org/trac/mpi-forum-
>>> web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> ------------------------------------
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list