[Mpi3-ft] Point-to-point Communications recovery

Richard Graham rlgraham at ornl.gov
Thu Oct 23 16:33:25 CDT 2008

(I am changing title to help keep track of discussions)

  We have not yet started a specific discussion on what to do with
communications when a process fails, so now seems as good a time as ever.
We also need to discuss what happens when we restart a process (i.e. do we
call MPI_Init()), but we should do this in a separate thread).

Keeping the focus on a single process failure, where we have a source and
destination (failed process), I see two cases:

* we do not restore the failed process
* we do restore the failed process

The first case seems to be trivial:
  The destination is gone, so nothing to do there.
  Source: ³flush² all communications associated the process that has failed.

The second case:
  The destination just re-initialized it¹s communications
  Source: ³flush² all outstanding traffic to the destination (part of the
repair() call, so negotiation can happen with the destination), and reset
the internal point-to-point sequence number so that new traffic can be
matched at the destination.

This implies that layered s/w (application or middleware) would be
responsible for regenerating any lost traffic, if this is needed.  Perhaps
if would make sense to provide the ability for upper layers to register a
recovery function that could be called after repair() is done with restoring
MPI internal state.

Greg, is the piggy-back capability you are asking for intended to help
getting the application level communications reset to the state just before
failure, so that the application can continue ?


On 10/23/08 11:49 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov> wrote:

> There is one caveat here that we should be aware of. There is not efficient
> way to implement this if we want the sender of a message to be informed that
> the receiver has been reset because then every send becomes a send-receive,
> which will significantly reduce performance. However, if we're willing to wait
> until the process receives data from the reset process either directly or via
> some dependence through other processes, then all this can be implemented
> efficiently. 
> Also, we should keep in mind that for some protocols we need both piggybacking
> and non-blocking collectives. The latter is to avoid race conditions where a
> process has begun a blocking collective call but needs to be informed of
> something having to do with the communication.
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>> If, as part of ft mpi, some piggy-back support is provided to the
>> application,
>> then i don't think this behavior would need to be implemented in the
>> mpi library.
>> Howard
>> Richard Graham wrote:
>>> Can someone think of a reason to have the library do this over the app ?  I
>>> can see that letting the library do this will avoid potential race
>>> conditions that could arise if we let the app do this - basically out of
>>> band with respect to the expected communications traffic.
>>> Rich
>>> On 10/21/08 11:52 PM, "Thomas Herault" <herault.thomas at gmail.com
>>> <herault.thomas at gmail.htm> > wrote:
>>> Le 21 oct. 08 à 22:06, Howard Pritchard a écrit :
>>>> > Hello Rich,
>>>> >
>>>> > I thought it was also agreed that if process A communicates with
>>>> > failed process B
>>>> > which had been restarted by another process C, and this was the
>>>> > first communication
>>>> > from A to B since the restart of B, A would receive the equivalent
>>>> > of a ECONNRESET error.
>>>> > This was in the context of a case where option 5 below is not being
>>>> > used by the app.
>>>> >
>>>> > Howard
>>>> >
>>> Hello Howard,
>>> there was still some discussions about this at the end of the session.
>>> The argument is that the application could do as well as the library
>>> to enforce this detection if this is needed: when a process is
>>> launched to replace another one, it could define a new revision/epoch/
>>> restart number and tag each communication with this number to
>>> implement the check. If this can be done as efficiently by the
>>> application as it would be done by the library, asking the application
>>> to do it itself would help the library to avoid the additional cost
>>> (i.e. piggybacking an integer to each message) when the application
>>> does not need that functionality.
>>> It was suggested that the library could provide a generic mean to
>>> piggyback this kind of information to each message, in a way similar
>>> as what is discussed about piggyback/message logging-based fault
>>> tolerance.
>>> Thomas
>>>> > Richard Graham wrote:
>>>>> >>
>>>>> >> Here is a summary of what I think that we agreed to today.  Please
>>>>> >> correct any errors, and add what I am missing.
>>>>> >>
>>>>> >>      € We need to be able to restore MPI_COMM_WORLD (and it¹s
>>>>> >> derivatives) to a usable state when a process fails.
>>>>> >>      € Restoration may involve having MPI_PROC_NULL replace the lost
>>>>> >> process, or may replaced the lost processes with a new process
>>>>> >> (have not specified how this would happen)
>>>>> >>      € Processes communicating directly with the failed processes will
>>>>> >> be notified via a returned error code about the failure.
>>>>> >>      € When a process is notified of the failure, comm_repair() must be
>>>>> >> called.  Comm_repair() is not a collective call, and is what will
>>>>> >> initiate the communicator repair associated with the failed process.
>>>>> >>      € If a process wants to be notified of process failure even if it
>>>>> >> is not communicating directly with this process, it must register
>>>>> >> for this notification.
>>>>> >>      € We don¹t have enough information to know how to continue with
>>>>> >> support for checkpoint/restart.
>>>>> >>      € We need to discuss what needs to do with respect to failure of
>>>>> >> collective communications.
>>>>> >>
>>>>> >> There are several issues that came up with respect to these, which
>>>>> >> will be detailed later on.
>>>>> >>
>>>>> >> Rich
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> mpi3-ft mailing list
>>>>> >> mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
>>>>> >> http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> <http:///> 
>>>>> >>
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Howard Pritchard
>>>> > Cray Inc.
>>>> >
>>>> > _______________________________________________
>>>> > mpi3-ft mailing list
>>>> > mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
>>>> > http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> <http:///> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
>>> http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> <http:///> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org <mailto:mpi3-ft at lists.mpi-forum.org>
>>> http:// <http:///>
>>> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft <http:///>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081023/ead0865a/attachment.html>

More information about the mpiwg-ft mailing list