[Mpi3-ft] Summary of today's meeting

Thomas Herault herault.thomas at gmail.com
Tue Oct 21 22:52:53 CDT 2008


Le 21 oct. 08 à 22:06, Howard Pritchard a écrit :

> Hello Rich,
>
> I thought it was also agreed that if process A communicates with  
> failed process B
> which had been restarted by another process C, and this was the  
> first communication
> from A to B since the restart of B, A would receive the equivalent  
> of a ECONNRESET error.
> This was in the context of a case where option 5 below is not being  
> used by the app.
>
> Howard
>

Hello Howard,

there was still some discussions about this at the end of the session.

The argument is that the application could do as well as the library  
to enforce this detection if this is needed: when a process is  
launched to replace another one, it could define a new revision/epoch/ 
restart number and tag each communication with this number to  
implement the check. If this can be done as efficiently by the  
application as it would be done by the library, asking the application  
to do it itself would help the library to avoid the additional cost  
(i.e. piggybacking an integer to each message) when the application  
does not need that functionality.

It was suggested that the library could provide a generic mean to  
piggyback this kind of information to each message, in a way similar  
as what is discussed about piggyback/message logging-based fault  
tolerance.

Thomas

> Richard Graham wrote:
>>
>> Here is a summary of what I think that we agreed to today.  Please  
>> correct any errors, and add what I am missing.
>>
>> 	• We need to be able to restore MPI_COMM_WORLD (and it’s  
>> derivatives) to a usable state when a process fails.
>> 	• Restoration may involve having MPI_PROC_NULL replace the lost  
>> process, or may replaced the lost processes with a new process  
>> (have not specified how this would happen)
>> 	• Processes communicating directly with the failed processes will  
>> be notified via a returned error code about the failure.
>> 	• When a process is notified of the failure, comm_repair() must be  
>> called.  Comm_repair() is not a collective call, and is what will  
>> initiate the communicator repair associated with the failed process.
>> 	• If a process wants to be notified of process failure even if it  
>> is not communicating directly with this process, it must register  
>> for this notification.
>> 	• We don’t have enough information to know how to continue with  
>> support for checkpoint/restart.
>> 	• We need to discuss what needs to do with respect to failure of  
>> collective communications.
>>
>> There are several issues that came up with respect to these, which  
>> will be detailed later on.
>>
>> Rich
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
>
> -- 
>
> Howard Pritchard
> Cray Inc.
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list