[Mpi3-ft] Point-to-point Communications recovery

Mon Oct 27 16:31:52 CDT 2008

>> > 
> 
> On Oct 23, 2008, at 5:33 PM, Richard Graham wrote:
> 
> > (I am changing title to help keep track of discussions)
> >
> >   We have not yet started a specific discussion on what to do with
> > communications when a process fails, so now seems as good a time as
> > ever.  We also need to discuss what happens when we restart a
> > process (i.e. do we call MPI_Init()), but we should do this in a
> > separate thread).
> >
> > Keeping the focus on a single process failure, where we have a
> > source and destination (failed process), I see two cases:
> >
> >       ‚Ä¢ we do not restore the failed process
> >       ‚Ä¢ we do restore the failed process
> >
> > The first case seems to be trivial:
> >   The destination is gone, so nothing to do there.
> >   Source: ‚Äúflush‚Äù all communications associated the process that has
> > failed.
> 
> I think there are a few unanswered questions for this case that make
> it a bit more complex (at least for me).
> 
> First, we are requiring an MPI implementation to detect failures. What
> kind of statement do we want in the standard about the consistency of
> the failure detector across the remaining, assumed alive processes? If
> one process fails from the perspective of one process, but not another
> (loss of connection to the peer) do we require that the failure
> detector involve a global consensus protocol or may it operate locally
> and possibly make the wrong assessment of the state of the process.

We are early on in the process at this stage, and are doing our best to
keep information local.  Global protocols tend to scale poorly, so we
are trying to see how far we can go with this approach.  We may (and will)
get into a situation where different processes have different views of
the world, but the question is, does this matter if the affected processes
have the correct information when they decide to participate in some
communication or group operations.  We have kept the option for all
processes to get "immediate" information, if requested, but this is not
the proposed default behavior.

> 
> Secondly, we left the working group meeting with the case in which a
> process P is waiting in a MPI_Recv() from a process Q. While waiting
> process Q fails. How do we wake up process P to notify them of the
> failure of process Q without requiring an asynchronous failure
> detection mechanism. We should think about ways of implementing this
> requirement. What if process P is in a MPI_Recv on MPI_ANY_SOURCE and
> process Q is in the communicator, do we signal an error or continue
> waiting?

This is an implementation detail - an important one.  We have agreed
that for the any_source receive, any failure in that communicator results
in that communication returning an error.

> 
> Finally, we must address the issue of the state of the communicators
> in MPI once a process fails. I think we were close to a proposal
> motivated by the FT-MPI semantics. We were discussing leaving a 'hole'
> in the communicator by default. When a process tried to communicate
> with a failed process, the MPI implementation returns an error, and
> the process decides how to deal with the hole in the communicator. It
> may choose to throw away the communicator. Or it may choose to convert
> the hole to MPI_PROC_NULL, and allow the repaired communicator to be
> used in all MPI functions. This forces MPI collectives, for instance,
> to be rewritten in such a way as to work around holes in the
> communicator.

We are taking a decision that is different than what FT-MPI decided, in
that we are proposing not to discard data, based on user input.  Maybe
you are pointing out a hole, in that we currently have not thought
about the case that the user would want to throw away all data associated
with a particular communicator.

The other decision we are leaning towards is that will will allow
gaps (MPI_PROC_NULL) in a communicator, but will not shrink the
communicator.

> 
> Additionally the interaction of process failure with global operations
> such as collectives (or some of the communicator operations, or I/O)
> is a complex question that we must try to address.

Agreed.  That is next.

> 
> I think there are a number of proposals that need to be explored both
> in wording and in implementation before we can try to fully address
> anything more complex, such as process replacement (Thought I
> encourage exploring this concurrently as well). Since a process
> replacement solution will be based on the decisions made when covering
> this foundational case.
> 

The goal is to have a first cut at an API for at least part of the problem
by the end of the forum meeting in Dec, so that we can go away and start
prototyping.  We have been doing quite a bit of talking over the last year,
and have made a lot of progress towards narrowing down on what we will
address initially, that we need to move to the prototyping stage.

Rich

> 
> >
> >
> > The second case:
> >   The destination just re-initialized it‚Äôs communications
> >   Source: ‚Äúflush‚Äù all outstanding traffic to the destination (part
> > of the repair() call, so negotiation can happen with the
> > destination), and reset the internal point-to-point sequence number
> > so that new traffic can be matched at the destination.
> >
> > This implies that layered s/w (application or middleware) would be
> > responsible for regenerating any lost traffic, if this is needed.
> > Perhaps if would make sense to provide the ability for upper layers
> > to register a recovery function that could be called after repair()
> > is done with restoring MPI internal state.
> >
> > Greg, is the piggy-back capability you are asking for intended to
> > help getting the application level communications reset to the state
> > just before failure, so that the application can continue ?
> >
> > Rich
> >
> >
> > On 10/23/08 11:49 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov>
> > wrote:
> >
> >> There is one caveat here that we should be aware of. There is not
> >> efficient way to implement this if we want the sender of a message
> >> to be informed that the receiver has been reset because then every
> >> send becomes a send-receive, which will significantly reduce
> >> performance. However, if we're willing to wait until the process
> >> receives data from the reset process either directly or via some
> >> dependence through other processes, then all this can be
> >> implemented efficiently.
> >>
> >> Also, we should keep in mind that for some protocols we need both
> >> piggybacking and non-blocking collectives. The latter is to avoid
> >> race conditions where a process has begun a blocking collective
> >> call but needs to be informed of something having to do with the
> >> communication.
> >>
> >> Greg Bronevetsky
> >> Post-Doctoral Researcher
> >> 1028 Building 451
> >> Lawrence Livermore National Lab
> >> (925) 424-5756
> >> bronevetsky1 at llnl.gov
> >>
> >>> If, as part of ft mpi, some piggy-back support is provided to the
> >>> application,
> >>> then i don't think this behavior would need to be implemented in the
> >>> mpi library.
> >>>
> >>> Howard
> >>>
> >>> Richard Graham wrote:
> >>>> Can someone think of a reason to have the library do this over
> >>>> the app ?  I can see that letting the library do this will avoid
> >>>> potential race conditions that could arise if we let the app do
> >>>> this - basically out of band with respect to the expected
> >>>> communications traffic.
> >>>>
> >>>> Rich
> >>>>
> >>>>
> >>>> On 10/21/08 11:52 PM, "Thomas Herault" <herault.thomas at gmail.com
<herault.thomas at gmail.htm
> >>>> > > wrote:
> >>>>
> >>>>
> >>>>
> >>>> Le 21 oct. 08 √† 22:06, Howard Pritchard a √©crit :
> >>>>
> >>>> > Hello Rich,
> >>>> >
> >>>> > I thought it was also agreed that if process A communicates with
> >>>> > failed process B
> >>>> > which had been restarted by another process C, and this was the
> >>>> > first communication
> >>>> > from A to B since the restart of B, A would receive the
> >>>> equivalent
> >>>> > of a ECONNRESET error.
> >>>> > This was in the context of a case where option 5 below is not
> >>>> being
> >>>> > used by the app.
> >>>> >
> >>>> > Howard
> >>>> >
> >>>>
> >>>> Hello Howard,
> >>>>
> >>>> there was still some discussions about this at the end of the
> >>>> session.
> >>>>
> >>>> The argument is that the application could do as well as the
> >>>> library
> >>>> to enforce this detection if this is needed: when a process is
> >>>> launched to replace another one, it could define a new revision/
> >>>> epoch/
> >>>> restart number and tag each communication with this number to
> >>>> implement the check. If this can be done as efficiently by the
> >>>> application as it would be done by the library, asking the
> >>>> application
> >>>> to do it itself would help the library to avoid the additional cost
> >>>> (i.e. piggybacking an integer to each message) when the application
> >>>> does not need that functionality.
> >>>>
> >>>> It was suggested that the library could provide a generic mean to
> >>>> piggyback this kind of information to each message, in a way
> >>>> similar
> >>>> as what is discussed about piggyback/message logging-based fault
> >>>> tolerance.
> >>>>
> >>>> Thomas
> >>>>
> >>>> > Richard Graham wrote:
> >>>> >>
> >>>> >> Here is a summary of what I think that we agreed to today.
> >>>> Please
> >>>> >> correct any errors, and add what I am missing.
> >>>> >>
> >>>> >>      ‚Ä¢ We need to be able to restore MPI_COMM_WORLD (and it‚Äôs
> >>>> >> derivatives) to a usable state when a process fails.
> >>>> >>      ‚Ä¢ Restoration may involve having MPI_PROC_NULL replace
> >>>> the lost
> >>>> >> process, or may replaced the lost processes with a new process
> >>>> >> (have not specified how this would happen)
> >>>> >>      ‚Ä¢ Processes communicating directly with the failed
> >>>> processes will
> >>>> >> be notified via a returned error code about the failure.
> >>>> >>      ‚Ä¢ When a process is notified of the failure,
> >>>> comm_repair() must be
> >>>> >> called.  Comm_repair() is not a collective call, and is what
> >>>> will
> >>>> >> initiate the communicator repair associated with the failed
> >>>> process.
> >>>> >>      ‚Ä¢ If a process wants to be notified of process failure
> >>>> even if it
> >>>> >> is not communicating directly with this process, it must
> >>>> register
> >>>> >> for this notification.
> >>>> >>      ‚Ä¢ We don‚Äôt have enough information to know how to
> >>>> continue with
> >>>> >> support for checkpoint/restart.
> >>>> >>      ‚Ä¢ We need to discuss what needs to do with respect to
> >>>> failure of
> >>>> >> collective communications.
> >>>> >>
> >>>> >> There are several issues that came up with respect to these,
> >>>> which
> >>>> >> will be detailed later on.
> >>>> >>
> >>>> >> Rich
> >>>> >>
> >>>> >> _______________________________________________
> >>>> >> mpi3-ft mailing list
> >>>> >> mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
> >>>> >> http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/
> >>>> mpi3-ft <http:///>
> >>>> >>
> >>>> >
> >>>> >
> >>>> > --
> >>>> >
> >>>> > Howard Pritchard
> >>>> > Cray Inc.
> >>>> >
> >>>> > _______________________________________________
> >>>> > mpi3-ft mailing list
> >>>> > mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
> >>>> > http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/
> >>>> mpi3-ft <http:///>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> mpi3-ft mailing list
> >>>> mpi3-ft at lists.mpi-forum.org <mpi3-ft at lists.mpi-forum.htm>
> >>>> http:// <http:///> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-
> >>>> ft <http:///>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> mpi3-ft mailing list
> >>>>
> >>>> mpi3-ft at lists.mpi-forum.org <mailto:mpi3-ft at lists.mpi-forum.org>
> >>>> http:// <http:///>
> >>>>
> >>>> lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft <http:///>
> >>>>
> >>>
> >>>
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081027/d8106f04/attachment-0001.html>