[Mpi3-ft] Summary of today's meeting

Richard Graham rlgraham at ornl.gov
Thu Oct 23 07:30:19 CDT 2008


Can someone think of a reason to have the library do this over the app ?  I
can see that letting the library do this will avoid potential race
conditions that could arise if we let the app do this - basically out of
band with respect to the expected communications traffic.

Rich


On 10/21/08 11:52 PM, "Thomas Herault" <herault.thomas at gmail.com> wrote:

> 
> 
> Le 21 oct. 08 à 22:06, Howard Pritchard a écrit :
> 
>> > Hello Rich,
>> >
>> > I thought it was also agreed that if process A communicates with
>> > failed process B
>> > which had been restarted by another process C, and this was the
>> > first communication
>> > from A to B since the restart of B, A would receive the equivalent
>> > of a ECONNRESET error.
>> > This was in the context of a case where option 5 below is not being
>> > used by the app.
>> >
>> > Howard
>> >
> 
> Hello Howard,
> 
> there was still some discussions about this at the end of the session.
> 
> The argument is that the application could do as well as the library
> to enforce this detection if this is needed: when a process is
> launched to replace another one, it could define a new revision/epoch/
> restart number and tag each communication with this number to
> implement the check. If this can be done as efficiently by the
> application as it would be done by the library, asking the application
> to do it itself would help the library to avoid the additional cost
> (i.e. piggybacking an integer to each message) when the application
> does not need that functionality.
> 
> It was suggested that the library could provide a generic mean to
> piggyback this kind of information to each message, in a way similar
> as what is discussed about piggyback/message logging-based fault
> tolerance.
> 
> Thomas
> 
>> > Richard Graham wrote:
>>> >>
>>> >> Here is a summary of what I think that we agreed to today.  Please
>>> >> correct any errors, and add what I am missing.
>>> >>
>>> >>      € We need to be able to restore MPI_COMM_WORLD (and it¹s
>>> >> derivatives) to a usable state when a process fails.
>>> >>      € Restoration may involve having MPI_PROC_NULL replace the lost
>>> >> process, or may replaced the lost processes with a new process
>>> >> (have not specified how this would happen)
>>> >>      € Processes communicating directly with the failed processes will
>>> >> be notified via a returned error code about the failure.
>>> >>      € When a process is notified of the failure, comm_repair() must be
>>> >> called.  Comm_repair() is not a collective call, and is what will
>>> >> initiate the communicator repair associated with the failed process.
>>> >>      € If a process wants to be notified of process failure even if it
>>> >> is not communicating directly with this process, it must register
>>> >> for this notification.
>>> >>      € We don¹t have enough information to know how to continue with
>>> >> support for checkpoint/restart.
>>> >>      € We need to discuss what needs to do with respect to failure of
>>> >> collective communications.
>>> >>
>>> >> There are several issues that came up with respect to these, which
>>> >> will be detailed later on.
>>> >>
>>> >> Rich
>>> >>
>>> >> _______________________________________________
>>> >> mpi3-ft mailing list
>>> >> mpi3-ft at lists.mpi-forum.org
>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> >>
>> >
>> >
>> > --
>> >
>> > Howard Pritchard
>> > Cray Inc.
>> >
>> > _______________________________________________
>> > mpi3-ft mailing list
>> > mpi3-ft at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081023/9cd56774/attachment-0001.html>


More information about the mpiwg-ft mailing list