[Mpi3-ft] Run-Through Stabilization Users Guide

Tue Feb 8 09:21:42 CST 2011

On Feb 7, 2011, at 5:13 PM, Bronevetsky, Greg wrote:

>> I think we should steer clear of even hinting to the application that we
>> are going to kill processes so that we can save memory. Now the process
>> could fail because MPI tried to allocate memory and ran out - at which
>> point the MPI should return an error, and shutdown services - Maybe a good
>> use case for CANNOT_CONTINUE. That situation is slightly different that
>> proactively killing processes because we don't want to/cannot track their
>> state. That sounds like a sure fire way to make users flee from the MPI to
>> something else.
>> 
> In that case, how do we deal with network partitions. As I mentioned in my email to Toon, the current semantics effectively force us to kill all the processes on one side of a network partition once the partition is healed. The same is true for any process that is erroneously judged to have failed because of a transient network problem. What do we do then?

Note the following is one possible implementation solution that I have often seen in literature, there are likely others.

For network partitions, the decision to terminate is (often) determined by a majority vote. So the group is the majority of alive processes wins (determined by the last known count of alive processes), and the group(s) with less than the majority self-terminate. If there is no clear majority (e.g., we divide into three even groups) then all groups will self terminate, and the application is terminated. So the application is never left in a split-brained scenario.

The MPI implementation may choose (and should choose) to mask transient failures that it has already escalated to permanent failures. If the MPI implementation has not reported the transient process failure to the application when it receives notification of the process still being alive, then the MPI implementation needs to decide how to proceed. This is the scenario of building a perfect failure detector from an unreliable failure detector. It is a quality of implementation concern to how an MPI implementation handles transient failures that have not been reported to the application. If the MPI implementation has reported the failure to the application, then it must ensure that the process is failed-stop - either by killing it or masking its influence for the duration of the job. This must be guaranteed across the job, eventually - eventual consistency.

There may be a period of time where one alive process sees a residual impression of a failed process due to network buffering. We cannot avoid this, but the MPI library should manage how this is relayed to the application depending on if it has decided this process to have failed yet or not.

As far as how to terminate the processes in the minority partition (which may be what you were getting at), that is probably something we should clarify. An implementation may choose to shutdown the MPI library, effectively isolating the process. The MPI library would return CANNOT_CONTINUE for all following actions, and the application should cleanup and terminate as normal. Alternatively, the MPI implementation can forcibly terminate the processes in the minority group by calling MPI_Abort(COMM_SELF) in each of those processes.

I think the first solution is a bit more kind to the application, and seems to fit with the wording of the current proposal. It provides the application with control over how it should shutdown, and save state if necessary.

By extension, this scenario leads to how we determine what it means to "abort all tasks in the group of comm" for MPI_Abort(comm). We could either shutdown the MPI library and return CANNOT_CONTINUE (or maybe MPI_ERROR_YOU_HAVE_BEEN_ABORTED_BY_A_PEER), or have the MPI library forcibly terminate the other processes in the group (by calling MPI_Abort(COMM_SELF) if possible).

I don't really have a strong leaning on the MPI_Abort() issue, and maybe we just leave it up to the implementation to define this scenario. Since if the MPI implementation forcibly terminates a process, that is effectively an unexpected fail-stop from the application perspective, and they are already setup to handle that. But on the flip side, giving the application more control, though isolated to operating by itself, is a bit more kind. The environment may dictate what is possible to provide in these scenarios. ... I donno, what do others think.

> 
>> As far as the application is concerned, it only needs to allocate an array
>> of MPI_Rank_info objects the size of that retuned by MPI_Comm_validate_all.
>> Which should be small, but is inclusive of all the known failures
>> (recognized and unrecognized).
>> 
>> We could add a flag to the local MPI_Comm_validate that would return a list
>> of the unrecognized failures, instead of all failures. Or, a bit more
>> specifically, a 'state' key argument to the function, and it will return
>> only those processes that match that state. This would allow the
>> application to manage smaller sets of failures without having to iterate
>> over previously recognized failures. What do you think about that?
>> 
> I like both ideas very much. The use of a flag for MPI_Comm_validate can also be used MPI_Comm_validate_all as well and it avoids the need for two collectives. Also, having MPI return a list would be a big improvement. The set of failed processes will be sparse in the overall set of MPI ranks, so forcing developers to allocate and iterate over arrays will be very expensive. Lists solve this problem and in are a generally more scalable way to communicate information about a few ranks from the set of all possible ranks.

We talked about this a bit more during the FT WG session at the MPI forum. We came up with a set of API solutions that we should discuss further. Since I was not able to hear the full conversation as accurately as I needed to, I asked that someone in the room post to the list the set of solutions for further discussion. But I think we are trending towards something along these lines.

> 
>> Ah I see what you are saying now. So add one more state in order to the
>> taxonomy:
> ... 
>> Globally cleared really only has meaning at the full communicator level
>> (used to determine if collectives are enabled/disabled). So wouldn't it be
>> better to have a way to query the communicator to see if it was globally
>> cleared or not, in order for the application to decide if collectives can
>> be used or not. We talked a while back about adding an attribute to the
>> communicator to indicate this, but I don't think it ever made it into the
>> proposal. I'll make a note about this, and maybe we can discuss what it
>> should look like next week.
>> 
> Ah, now I see what You're saying! If any rank in a communicator is not globally cleared, then the only remedy is to clear the entire communicator. As such, the only useful API is one that checks the global-cleared status of communicators, not individual ranks. Yeah, that makes sense. Also, I think you should change the terminology from "globally" cleared to something more connected to communicators. For example, a rank may be "locally cleared" and a communicator can be "cleared" but a rank can't be "globally cleared"

Getting a clear terminology for this is important as we move closer to standard language. So this is good.

How about:
An individual rank can be:
 - alive (MPI_RANK_STATE_OK)
 - unrecognized failed (MPI_RANK_STATE_FAILED)
 - locally cleared (MPI_RANK_STATE_NULL)
A communicator can be (and by extension file handles and windows):
 - collectively cleared
 - collectively uncleared

So we would use phrases like:
Collectives: "the associated communicator must be collectively cleared before calling this function"
P2P: "The associated rank must be alive or locally cleared before calling this function"

I think that propagating the word 'collectively' with the state of the communicator (though redundant) helps clarify that clearing ranks locally does not lead to a clear communicator for collectives, the app. must call MPI_Comm_validate_all.

Thanks for all the feedback, keep it coming :)

-- Josh

> 
> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov
> http://greg.bronevetsky.com
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey