Richard Graham rlgraham at ornl.gov
Tue Oct 21 21:02:51 CDT 2008

Here is a summary of what I think that we agreed to today.  Please correct
any errors, and add what I am missing.

* We need to be able to restore MPI_COMM_WORLD (and it¹s derivatives) to a
usable state when a process fails.
* Restoration may involve having MPI_PROC_NULL replace the lost process, or
may replaced the lost processes with a new process (have not specified how
this would happen) 
* Processes communicating directly with the failed processes will be
notified via a returned error code about the failure.
* When a process is notified of the failure, comm_repair() must be called.
Comm_repair() is not a collective call, and is what will initiate the
communicator repair associated with the failed process.
* If a process wants to be notified of process failure even if it is not
communicating directly with this process, it must register for this
* We don¹t have enough information to know how to continue with support for
* We need to discuss what needs to do with respect to failure of collective

There are several issues that came up with respect to these, which will be
detailed later on.


