[Mpi3-ft] simplified FT proposal

Sur, Sayantan sayantan.sur at intel.com
Tue Jan 17 17:15:26 CST 2012


Hi Josh and Rich,

Thanks for responding to my email in detail! Since we have a teleconf scheduled tomorrow, we can discuss each point. I want to clarify the motivations behind the rough proposal I made to start the thread.

Broadly, my view is that FT is a tough problem and any real solution - especially one that scales - requires work from process launcher/runtime, MPI library, and Applications/other libraries. The manner in which we divide the work between all these components is key.

My thoughts are that we should simplify the role of MPI to the extent possible, and have FT-libraries take up some work of providing better semantics to applications. Since there is enough research/debate on-going about ABFT techniques, we might want to simply *enable* fault-tolerant libraries to be built on top of MPI, and only standardize them when there are well-established use cases.

The sketch of the model I proposed does not attempt to make every MPI program fault tolerant. Rather, it PERMITS a fault-tolerant version, thereby splitting the work between MPI and its consumers. Let me clarify this with the MPI_ANY_SOURCE master-worker example.

"What happens when you comm_check/validate a communicator, post MPI_Recv(ANY) and the rank supposed to send you the message dies? Alternatively, what happens when you have MPI_Recv(ANY) and all other processes have died?"

In this model - nothing. The program hangs. The app chose to block wait for any process, and all have died. Tough luck. IMHO, this is a simple behavior. MPI is not trying to be any smarter here than it needs to.

However, consider a FT-library which uses a non-blocking receive. The master process keeps a timer of when it last heard from a process. If the timer is larger than expected, try to do a "ping-pong" with the process. If the remote process has indeed failed, then the MPI library returns an error (p2p coordination required from MPI not global). This is a simple example of application-driven heartbeat protocol.

Well, the question naturally arises that why make the app/lib do all this work and not do it within MPI? I would argue that if you had to implement the notification-type failure as in the current proposal, it would require you to provide some kind of process-manager backbone that did exactly the same thing in the background. You would need to periodically collect heart-beats from everyone JUST in case someone dies. At scale, you would need to do this constantly and communicate with everyone.

The model that I am proposing is actually very similar to yours, it just relaxes the guarantees and semantics. My hope is that the semantics we lose can be provided by not-very-complicated libraries on top of MPI.

PS: I agree that MPI_Comm_check() is very similar to the existing MPI_Comm_validate(). I chose a different name as not to confuse the two.

===
Sayantan Sur, Ph.D.
Intel Corp.

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
Sent: Tuesday, January 17, 2012 12:52 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] simplified FT proposal

Sayantan,

Thanks for presentation of an alternative proposal. Alternative proposals are invaluable in helping the group work towards a more refined final proposal for the forum.

I do have some questions, comments, and concerns about your proposal detailed below by section.

Thanks,
Josh

(1) Consider a communicator with 1 million processes and 10 die. Would you rather reason about 10 dead or (1M-10) alive? At some point it does not matter much, but for the common case where the number of failures is much less than the size of the MPI universe then it is computationally easier to reason about the set of failed. What is the exact meaning of "is_ok"?

(2) The phrase "collectively active/inactive" is bad - as illustrated by the reaction in the meeting. For the discussion consider a MPI_Reduce(). The root may be connected to alive processes in a tree structure, and is waiting patiently for a result. Somewhere in the tree a process fails before entering the MPI_Reduce(). Is it valid for the root to wait forever - locally observed behavior is a hang? or must the processes in the tree propagate the error - locally observed behavior is a transitive error? So considering that situation what must we say in the MPI standard about collective operations?

(3) I'm fine with removing most of the failure detector statement if that is what the group wants. It served its purpose as a method to ground the semantics. I feel it is useful to keep in (it is precise and well grounded in theory), but if it has become distracting then we can consider removing it. I'm concerned with the language that "requests may complete with MPI_ERR_PROC_FAIL_STOP" as a way to allude to the option of MPI being able to delay notification. The wording also leaves open the opportunity for an implementation to wait forever to deliver the failure notification thus hanging the program. That is one of the nice aspects of stating explicitly that eventually every process -may- know of the failure, and that communication will complete (as stated in each of the subchapters). But that conversation tends to come back around to well we should make an explicit statement like we have in the document now...

(4) Yeah I think Failure Handlers are distracting the conversation, and we should consider bringing them in separately. It would be useful to have a notification of process failure -not- tied to the current MPI call context. Which is what the failure handlers were meant to do. However the following question is what can a user do inside a failure handler, which is what lead to the discussion of the different modes, calling order, ... There is a strong use case for these from UTK to which they can elaborate better than I.

(5) So you are proposing the MPI_ANY_SOURCE receives -never- return in error, even if the only process alive in the communicator is the calling process? You suggest that an application should call MPI_Comm_check/validate before calling the MPI_Recv(ANY)? What if a failure occurs between these two operations that would have changed the decision of the application to post the MPI_Recv(ANY) - e.g., the caller is the only process left in the communicator? There remains problems with threading, but much of those problems are derived from the above questions.

(6) ok

(7) I do not really follow why MPI_Comm_check is better than _validate? Can you elaborate? How does an Allreduce provide you the same semantic guarantees as a MPI_Comm_check/validate? A process might have failed in the tree during the result distribution phase causing dependent processes to fail the Allreduce operation. The point is that MPI_Allreduce is not a fault tolerant agreement algorithm, MPI_Comm_check/validate is. They may have the same failure free complexity, but they are semantically different.

(8) The reason we want to allow for the option of reusing a communicator with holes in the membership is for application indexing. This was something that applications requested as many of them use the rank in the communicator to calculate offsets into data structures or participation in the operation. So keeping their rank was seen as important. An application that wants to can create a new communicator with only the alive ranks, but is not required to do so. A different tact, MPI_COMM_WORLD will have holes in it upon process failure, so is it valid to call a collective operation (like MPI_Comm_create) over it to get the new shrunken communicator?

(9) Applications may want to keep track of failed processes for application indexing, as just mentioned, but also for later recovery (you need to know who failed if you are going to recover them).

(10) I don't think the semantics for MPI_Comm_size are odd at all. MPI_Comm_size returns the total size of the communicator, a value that is often used to setup data structures of appropriate size and to reason about the extent to names in the communicator. So, since communicators have a static size, MPI_Comm_size should always return that static size. If you only track the alive processes, how do you know from which you should recover?

(11) So communicator creation should still be 'transactional' - meaning that it returns uniformly at all processes?


On Fri, Jan 13, 2012 at 4:41 PM, Sur, Sayantan <sayantan.sur at intel.com<mailto:sayantan.sur at intel.com>> wrote:
FT WG,

I would like to thank Josh for enduring the marathon plenary presentation! It was truly commendable.

Based on the Forum feedback and vote, it is apparent that there are some significant issues. Primarily due to several new concepts and terms, that the larger Forum does not believe to be required, OR present implementation challenges for the rest of MPI library.

I would like to argue for a simplified version of the proposal that covers a large percentage of use-cases and resists adding new "features" for the full-range of ABFT techniques. It is good if we have a more pragmatic view and not sacrifice the entire FT proposal for the 1% fringe cases. Most apps just want to do something like this:

for(... really long time ...) {
   MPI_Comm_check(work_comm, &is_ok, &alive_group);
   if(!is_ok) {
       MPI_Comm_create_group(alive_group, ..., &new_comm);
      // re-balance workload and use new_comm in rest of computation
       MPI_Comm_free(work_comm); // get rid of old comm
       work_comm = new_comm;
   } else {
     // continue computation using work_comm
     // if some proc failed in this iteration, roll back work done in this iteration, go back to loop
   }
}

Here are some modifications I would like to propose to the current chapter (in order as these concepts/terms appear in the text):


1.       Remove concept of "recognized failed" processes. As was pointed out in the meeting, we don't really care about the failed processes, rather the alive ones. Accordingly, rename MPI_Comm(win/file)_validate() to MPI_Comm(win/file)_check(MPI_Comm comm, int * is_ok, MPI_Group * alive_group);

2.       Remove concept of "collectively inactive/active".  This doesn't really bring anything to the table, rather conflicts with existing definition of collectives. MPI defines collectives as being equivalent of a series of point-to-point calls. As per that definition, if the point-to-point calls succeed (i.e. the corresponding processes are alive), then as locally observed, collective call has also succeeded. As far as the application is concerned as long as the local part of collective is complete successfully, it is OK. If they want to figure out global status, they can always call MPI_Comm_check() or friends.

3.       Eventually perfect failure detector/strongly complete/strongly accurate/etc: We replace this discussion (even remove much of 17.3) with a much more straight-forward requirement - "Communication with a process completes with either success or error. In case of communication with failed processes, communication calls and requests may complete with MPI_ERR_PROC_FAILSTOP." Note that MPI standard requires all communication to complete before calling MPI_Finalize - therefore, the first part of this requirement is nothing new. The second part indicates that there is no guarantee that communication with a failed process *will* fail. Messages may have been internally buffered before the real failure may still be delivered per existing MPI semantics.

a.       This does raise the question from implementers: "When do I mark requests as MPI_ERR_PROC_FAILSTOP? How long do I wait?" The answer completely depends on the implementation. Obviously, there is some requirement to deal with process launcher runtime. In some implementations with connected mode may be able to leverage hw or os techniques to detect connections that have gone down. MPI implementations using connection-less transports may need additional work. However, *none* of this is new work/concepts. As far as possible, we should talk minimally about what the MPI implementation might do to achieve this.

4.       Remove process failure handlers - 17.5.1, 17.5.2, 17.5.3, 17.5.4. The only way to find out if something failed is to call MPI_Comm_check() and friends. This removes a whole lot of complexity with failure handlers. Fail handlers can be emulated over this interface as a library. We may consider them for MPI-3.1 (or 4).

5.       Point-to-point communication: Remove the concept of MPI_ERR_ANY_SOURCE_DISABLED and corresponding calls to re-enable any source. The concept of disabling ANY_SOURCE is counter-intuitive. When an app/lib posts a recv with ANY_SOURCE, it is specifically telling the MPI library that *any* source is OK and implicitly means that if some senders are unable to send, application/lib does not care! Master/slave type of applications wishing to use FT features can periodically call MPI_Comm_check(). Additionally, if the master tries to send to the dead process, it may get an error. My guess is that master/slave type of apps are among the most resilient, and some even work with the current standard (MPI_ERRORS_RETURN). A benefit of removing this restriction is that we no longer have the threading complexities of re-enabling any source using reader/writer locks :) Therefore, we can remove 17.6.3.

6.       Retain MPI_Comm_drain() and MPI_Comm_idrain() as they provide useful functionality.

7.       Collective communication: Rename comm_validate() to comm_check() as per discussion above. We can keep comm_check_multiple() as it provides useful functionality for overlapping communicators by reducing overhead to check them. We can retain much of 17.7.2 while removing references to "collectively inactive". If the output of collective depends on contribution from a failed process, then obviously, the collective fails. This is in keeping with point-to-point semantics - one cannot receive any data from a failed process. Keep in mind if the contribution from failed process may have arrived before it failed - and that is OK (not flagged as failure). Some collectives, such as MPI_Bcast, may succeed even if processes down the bcast tree have failed as sends may simply be buffered. The app/lib will only know if a collective was a global success by either performing an Allreduce after the collective OR calling comm_check(). In any case, it is left to app/lib and not MPI to report failures of processes the library didn't try to communicate with during this op.

8.       I am proposing that once a collective fails with MPI_ERR_PROC_FAIL_STOP, all subsequent collectives on that comm fail immediately with MPI_ERR_PROC_FAIL_STOP. App/lib needs to use MPI_Comm_create_group() to fork off a new comm of live procs and continue with it. This is a deviation from the current proposal that allows collectives on bad comms (after re-enabling collectives) and keeps 0s as contributions. I am aware that this might not fully satisfy all use cases (although at this point of time, I cannot think of any), but in a broader view, we could think this of as a compromise to reduce complexity.

9.       Example 17.4 changes only slightly to call comm_check() and then split off the new communicator. Why keep failed procs in the communicator anyways?

10.   Note that this change in semantics allows us to bypass the question raised: "Why does comm_size() on a communicator with failed procs still return the old value- alive_ranks + failed_ranks?" As I mentioned before, this is odd, and we should encourage app/lib to only deal with known alive ranks. The current proposal does the reverse - forces app to keep track of "known failed". This causes confusion!

11.   Process topologies 17.9 - should change to say that we can only use communicators with live ranks. i.e. if you know your comm was bad, split off a new comm with live ranks. During the op, some ranks may fail - and that is OK since MPI_ERR_PROC_FAIL_STOP will be raised. This is mentioned in the current proposal.

12.   Similar changes in semantics to windows and files.

Please let me know if I overlooked some corner cases or I have mis-interpreted the text of the current chapter. I gave it some thought, but WG knows best!


Thanks!

===
Sayantan Sur, Ph.D.
Intel Corp.


_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft



--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120117/4a3b069b/attachment-0001.html>


More information about the mpiwg-ft mailing list