<HTML>

<HEAD>

<TITLE>Re: [Mpi3-ft] Point-to-point Communications recovery</TITLE>

</HEAD>

<BODY>

<BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>> <BR>

</SPAN></FONT></BLOCKQUOTE><FONT FACE="Calibri, Verdana, Helvetica, Arial"><SPAN STYLE='font-size:11pt'>> <BR>

> On Oct 23, 2008, at 5:33 PM, Richard Graham wrote:<BR>

> <BR>

> > (I am changing title to help keep track of discussions)<BR>

> ><BR>

> >   We have not yet started a specific discussion on what to do with <BR>

> > communications when a process fails, so now seems as good a time as <BR>

> > ever.  We also need to discuss what happens when we restart a <BR>

> > process (i.e. do we call MPI_Init()), but we should do this in a <BR>

> > separate thread).<BR>

> ><BR>

> > Keeping the focus on a single process failure, where we have a <BR>

> > source and destination (failed process), I see two cases:<BR>

> ><BR>

> >       ‚Ä¢ we do not restore the failed process<BR>

> >       ‚Ä¢ we do restore the failed process<BR>

> ><BR>

> > The first case seems to be trivial:<BR>

> >   The destination is gone, so nothing to do there.<BR>

> >   Source: ‚Äúflush‚Äù all communications associated the process that has <BR>

> > failed.<BR>

> <BR>

> I think there are a few unanswered questions for this case that make <BR>

> it a bit more complex (at least for me).<BR>

> <BR>

> First, we are requiring an MPI implementation to detect failures. What <BR>

> kind of statement do we want in the standard about the consistency of <BR>

> the failure detector across the remaining, assumed alive processes? If <BR>

> one process fails from the perspective of one process, but not another <BR>

> (loss of connection to the peer) do we require that the failure <BR>

> detector involve a global consensus protocol or may it operate locally <BR>

> and possibly make the wrong assessment of the state of the process.<BR>

<BR>

We are early on in the process at this stage, and are doing our best to<BR>

keep information local.  Global protocols tend to scale poorly, so we<BR>

are trying to see how far we can go with this approach.  We may (and will)<BR>

get into a situation where different processes have different views of <BR>

the world, but the question is, does this matter if the affected processes<BR>

have the correct information when they decide to participate in some<BR>

communication or group operations.  We have kept the option for all<BR>

processes to get "immediate" information, if requested, but this is not<BR>

the proposed default behavior.<BR>

<BR>

> <BR>

> Secondly, we left the working group meeting with the case in which a <BR>

> process P is waiting in a MPI_Recv() from a process Q. While waiting <BR>

> process Q fails. How do we wake up process P to notify them of the <BR>

> failure of process Q without requiring an asynchronous failure <BR>

> detection mechanism. We should think about ways of implementing this <BR>

> requirement. What if process P is in a MPI_Recv on MPI_ANY_SOURCE and <BR>

> process Q is in the communicator, do we signal an error or continue <BR>

> waiting?<BR>

<BR>

This is an implementation detail - an important one.  We have agreed<BR>

that for the any_source receive, any failure in that communicator results<BR>

in that communication returning an error.<BR>

<BR>

> <BR>

> Finally, we must address the issue of the state of the communicators <BR>

> in MPI once a process fails. I think we were close to a proposal <BR>

> motivated by the FT-MPI semantics. We were discussing leaving a 'hole' <BR>

> in the communicator by default. When a process tried to communicate <BR>

> with a failed process, the MPI implementation returns an error, and <BR>

> the process decides how to deal with the hole in the communicator. It <BR>

> may choose to throw away the communicator. Or it may choose to convert <BR>

> the hole to MPI_PROC_NULL, and allow the repaired communicator to be <BR>

> used in all MPI functions. This forces MPI collectives, for instance, <BR>

> to be rewritten in such a way as to work around holes in the <BR>

> communicator.<BR>

<BR>

We are taking a decision that is different than what FT-MPI decided, in<BR>

that we are proposing not to discard data, based on user input.  Maybe<BR>

you are pointing out a hole, in that we currently have not thought<BR>

about the case that the user would want to throw away all data associated<BR>

with a particular communicator.<BR>

<BR>

The other decision we are leaning towards is that will will allow<BR>

gaps (MPI_PROC_NULL) in a communicator, but will not shrink the<BR>

communicator.<BR>

<BR>

> <BR>

> Additionally the interaction of process failure with global operations <BR>

> such as collectives (or some of the communicator operations, or I/O) <BR>

> is a complex question that we must try to address.<BR>

<BR>

Agreed.  That is next.  <BR>

<BR>

> <BR>

> I think there are a number of proposals that need to be explored both <BR>

> in wording and in implementation before we can try to fully address <BR>

> anything more complex, such as process replacement (Thought I <BR>

> encourage exploring this concurrently as well). Since a process <BR>

> replacement solution will be based on the decisions made when covering <BR>

> this foundational case.<BR>

> <BR>

<BR>

The goal is to have a first cut at an API for at least part of the problem<BR>

by the end of the forum meeting in Dec, so that we can go away and start<BR>

prototyping.  We have been doing quite a bit of talking over the last year,<BR>

and have made a lot of progress towards narrowing down on what we will<BR>

address initially, that we need to move to the prototyping stage.<BR>

<BR>

Rich<BR>

<BR>

> <BR>

> ><BR>

> ><BR>

> > The second case:<BR>

> >   The destination just re-initialized it‚Äôs communications<BR>

> >   Source: ‚Äúflush‚Äù all outstanding traffic to the destination (part <BR>

> > of the repair() call, so negotiation can happen with the <BR>

> > destination), and reset the internal point-to-point sequence number <BR>

> > so that new traffic can be matched at the destination.<BR>

> ><BR>

> > This implies that layered s/w (application or middleware) would be <BR>

> > responsible for regenerating any lost traffic, if this is needed.  <BR>

> > Perhaps if would make sense to provide the ability for upper layers <BR>

> > to register a recovery function that could be called after repair() <BR>

> > is done with restoring MPI internal state.<BR>

> ><BR>

> > Greg, is the piggy-back capability you are asking for intended to <BR>

> > help getting the application level communications reset to the state <BR>

> > just before failure, so that the application can continue ?<BR>

> ><BR>

> > Rich<BR>

> ><BR>

> ><BR>

> > On 10/23/08 11:49 AM, "Greg Bronevetsky" <<a href="bronevetsky1@llnl.gov">bronevetsky1@llnl.gov</a>> <BR>

> > wrote:<BR>

> ><BR>

> >> There is one caveat here that we should be aware of. There is not <BR>

> >> efficient way to implement this if we want the sender of a message <BR>

> >> to be informed that the receiver has been reset because then every <BR>

> >> send becomes a send-receive, which will significantly reduce <BR>

> >> performance. However, if we're willing to wait until the process <BR>

> >> receives data from the reset process either directly or via some <BR>

> >> dependence through other processes, then all this can be <BR>

> >> implemented efficiently.<BR>

> >><BR>

> >> Also, we should keep in mind that for some protocols we need both <BR>

> >> piggybacking and non-blocking collectives. The latter is to avoid <BR>

> >> race conditions where a process has begun a blocking collective <BR>

> >> call but needs to be informed of something having to do with the <BR>

> >> communication.<BR>

> >><BR>

> >> Greg Bronevetsky<BR>

> >> Post-Doctoral Researcher<BR>

> >> 1028 Building 451<BR>

> >> Lawrence Livermore National Lab<BR>

> >> (925) 424-5756<BR>

> >> <a href="bronevetsky1@llnl.gov">bronevetsky1@llnl.gov</a><BR>

> >><BR>

> >>> If, as part of ft mpi, some piggy-back support is provided to the <BR>

> >>> application,<BR>

> >>> then i don't think this behavior would need to be implemented in the<BR>

> >>> mpi library.<BR>

> >>><BR>

> >>> Howard<BR>

> >>><BR>

> >>> Richard Graham wrote:<BR>

> >>>> Can someone think of a reason to have the library do this over <BR>

> >>>> the app ?  I can see that letting the library do this will avoid <BR>

> >>>> potential race conditions that could arise if we let the app do <BR>

> >>>> this - basically out of band with respect to the expected <BR>

> >>>> communications traffic.<BR>

> >>>><BR>

> >>>> Rich<BR>

> >>>><BR>

> >>>><BR>

> >>>> On 10/21/08 11:52 PM, "Thomas Herault" <<a href="herault.thomas@gmail.com">herault.thomas@gmail.com</a> <<a href="herault.thomas@gmail.htm">herault.thomas@gmail.htm</a><BR>

> >>>> > > wrote:<BR>

> >>>><BR>

> >>>><BR>

> >>>><BR>

> >>>> Le 21 oct. 08 √† 22:06, Howard Pritchard a √©crit :<BR>

> >>>><BR>

> >>>> > Hello Rich,<BR>

> >>>> ><BR>

> >>>> > I thought it was also agreed that if process A communicates with<BR>

> >>>> > failed process B<BR>

> >>>> > which had been restarted by another process C, and this was the<BR>

> >>>> > first communication<BR>

> >>>> > from A to B since the restart of B, A would receive the <BR>

> >>>> equivalent<BR>

> >>>> > of a ECONNRESET error.<BR>

> >>>> > This was in the context of a case where option 5 below is not <BR>

> >>>> being<BR>

> >>>> > used by the app.<BR>

> >>>> ><BR>

> >>>> > Howard<BR>

> >>>> ><BR>

> >>>><BR>

> >>>> Hello Howard,<BR>

> >>>><BR>

> >>>> there was still some discussions about this at the end of the <BR>

> >>>> session.<BR>

> >>>><BR>

> >>>> The argument is that the application could do as well as the <BR>

> >>>> library<BR>

> >>>> to enforce this detection if this is needed: when a process is<BR>

> >>>> launched to replace another one, it could define a new revision/<BR>

> >>>> epoch/<BR>

> >>>> restart number and tag each communication with this number to<BR>

> >>>> implement the check. If this can be done as efficiently by the<BR>

> >>>> application as it would be done by the library, asking the <BR>

> >>>> application<BR>

> >>>> to do it itself would help the library to avoid the additional cost<BR>

> >>>> (i.e. piggybacking an integer to each message) when the application<BR>

> >>>> does not need that functionality.<BR>

> >>>><BR>

> >>>> It was suggested that the library could provide a generic mean to<BR>

> >>>> piggyback this kind of information to each message, in a way <BR>

> >>>> similar<BR>

> >>>> as what is discussed about piggyback/message logging-based fault<BR>

> >>>> tolerance.<BR>

> >>>><BR>

> >>>> Thomas<BR>

> >>>><BR>

> >>>> > Richard Graham wrote:<BR>

> >>>> >><BR>

> >>>> >> Here is a summary of what I think that we agreed to today.  <BR>

> >>>> Please<BR>

> >>>> >> correct any errors, and add what I am missing.<BR>

> >>>> >><BR>

> >>>> >>      ‚Ä¢ We need to be able to restore MPI_COMM_WORLD (and it‚Äôs<BR>

> >>>> >> derivatives) to a usable state when a process fails.<BR>

> >>>> >>      ‚Ä¢ Restoration may involve having MPI_PROC_NULL replace <BR>

> >>>> the lost<BR>

> >>>> >> process, or may replaced the lost processes with a new process<BR>

> >>>> >> (have not specified how this would happen)<BR>

> >>>> >>      ‚Ä¢ Processes communicating directly with the failed <BR>

> >>>> processes will<BR>

> >>>> >> be notified via a returned error code about the failure.<BR>

> >>>> >>      ‚Ä¢ When a process is notified of the failure, <BR>

> >>>> comm_repair() must be<BR>

> >>>> >> called.  Comm_repair() is not a collective call, and is what <BR>

> >>>> will<BR>

> >>>> >> initiate the communicator repair associated with the failed <BR>

> >>>> process.<BR>

> >>>> >>      ‚Ä¢ If a process wants to be notified of process failure <BR>

> >>>> even if it<BR>

> >>>> >> is not communicating directly with this process, it must <BR>

> >>>> register<BR>

> >>>> >> for this notification.<BR>

> >>>> >>      ‚Ä¢ We don‚Äôt have enough information to know how to <BR>

> >>>> continue with<BR>

> >>>> >> support for checkpoint/restart.<BR>

> >>>> >>      ‚Ä¢ We need to discuss what needs to do with respect to <BR>

> >>>> failure of<BR>

> >>>> >> collective communications.<BR>

> >>>> >><BR>

> >>>> >> There are several issues that came up with respect to these, <BR>

> >>>> which<BR>

> >>>> >> will be detailed later on.<BR>

> >>>> >><BR>

> >>>> >> Rich<BR>

> >>>> >><BR>

> >>>> >> _______________________________________________<BR>

> >>>> >> mpi3-ft mailing list<BR>

> >>>> >> <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a> <<a href="mpi3-ft@lists.mpi-forum.htm">mpi3-ft@lists.mpi-forum.htm</a>><BR>

> >>>> >> <a href="http://">http://</a> <<a href="http:///">http:///</a>> lists.mpi-forum.org/mailman/listinfo.cgi/<BR>

> >>>> mpi3-ft <<a href="http:///">http:///</a>><BR>

> >>>> >><BR>

> >>>> ><BR>

> >>>> ><BR>

> >>>> > --<BR>

> >>>> ><BR>

> >>>> > Howard Pritchard<BR>

> >>>> > Cray Inc.<BR>

> >>>> ><BR>

> >>>> > _______________________________________________<BR>

> >>>> > mpi3-ft mailing list<BR>

> >>>> > <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a> <<a href="mpi3-ft@lists.mpi-forum.htm">mpi3-ft@lists.mpi-forum.htm</a>><BR>

> >>>> > <a href="http://">http://</a> <<a href="http:///">http:///</a>> lists.mpi-forum.org/mailman/listinfo.cgi/<BR>

> >>>> mpi3-ft <<a href="http:///">http:///</a>><BR>

> >>>><BR>

> >>>><BR>

> >>>> _______________________________________________<BR>

> >>>> mpi3-ft mailing list<BR>

> >>>> <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a> <<a href="mpi3-ft@lists.mpi-forum.htm">mpi3-ft@lists.mpi-forum.htm</a>><BR>

> >>>> <a href="http://">http://</a> <<a href="http:///">http:///</a>> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-<BR>

> >>>> ft <<a href="http:///">http:///</a>><BR>

> >>>><BR>

> >>>><BR>

> >>>><BR>

> >>>><BR>

> >>>><BR>

> >>>> _______________________________________________<BR>

> >>>> mpi3-ft mailing list<BR>

> >>>><BR>

> >>>> <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a> <<a href="mailto:mpi3-ft@lists.mpi-forum.org">mailto:mpi3-ft@lists.mpi-forum.org</a>><BR>

> >>>> <a href="http://">http://</a> <<a href="http:///">http:///</a>><BR>

> >>>><BR>

> >>>> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft <<a href="http:///">http:///</a>><BR>

> >>>><BR>

> >>><BR>

> >>><BR>

> > _______________________________________________<BR>

> > mpi3-ft mailing list<BR>

> > <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><BR>

> > <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><BR>

> <BR>

> <BR>

> _______________________________________________<BR>

> mpi3-ft mailing list<BR>

> <a href="mpi3-ft@lists.mpi-forum.org">mpi3-ft@lists.mpi-forum.org</a><BR>

> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</a><BR>

> <BR>

> <BR>

</SPAN></FONT>

</BODY>

</HTML>