[Mpi3-ft] Summary of today's meeting

Richard Graham rlgraham at ornl.gov
Tue Oct 21 21:02:51 CDT 2008


Here is a summary of what I think that we agreed to today.  Please correct
any errors, and add what I am missing.

* We need to be able to restore MPI_COMM_WORLD (and it¹s derivatives) to a
usable state when a process fails.
* Restoration may involve having MPI_PROC_NULL replace the lost process, or
may replaced the lost processes with a new process (have not specified how
this would happen) 
* Processes communicating directly with the failed processes will be
notified via a returned error code about the failure.
* When a process is notified of the failure, comm_repair() must be called.
Comm_repair() is not a collective call, and is what will initiate the
communicator repair associated with the failed process.
* If a process wants to be notified of process failure even if it is not
communicating directly with this process, it must register for this
notification. 
* We don¹t have enough information to know how to continue with support for
checkpoint/restart.
* We need to discuss what needs to do with respect to failure of collective
communications.

There are several issues that came up with respect to these, which will be
detailed later on.

Rich

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081021/94335678/attachment.html>


More information about the mpiwg-ft mailing list