[Mpi3-ft] Summary of today's meeting

Erez Haba erezh at MICROSOFT.com
Wed Oct 22 20:52:45 CDT 2008


Thanks for capturing this.

My comments inline...

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Richard Graham
Sent: Tuesday, October 21, 2008 9:03 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Summary of today's meeting

Here is a summary of what I think that we agreed to today.  Please correct any errors, and add what I am missing.

 *   We need to be able to restore MPI_COMM_WORLD (and it's derivatives) to a usable state when a process fails.
[erezh] I think that we discussed this with reference to the comment that MPI is not usable once it returned an error. we need to address that in the current standard. (I think that this should be the first item on the list)
[erezh] as I recall the second item on the list, is returning errors per call site (per the Error Reporting Rules proposal)
[erez] as for this specific items, I think that the wording should be "repair" rather than restore (when repair is either making a "hole" in the communicator or "filling" the whole with a new process.

 *   Restoration may involve having MPI_PROC_NULL replace the lost process, or may replaced the lost processes with a new process (have not specified how this would happen)
[erezh] again I would replace "restoration" with "repair"
[erezh] We said that we can use MPI_PROC_NULL for making a "hole". i.e., the communicator will not be in the error state anymore (thus you can receive from MPI_ANY_SOURCE or use a collective) however any direct communication with the "hole" rank is like using MPI_PROC_NULL.
[erezh] We also said that replacing the lost process with a new one only applies to MPI_COMM_WORD.

 *   Processes communicating directly with the failed processes will be notified via a returned error code about the failure.
 *   When a process is notified of the failure, comm_repair() must be called.  Comm_repair() is not a collective call, and is what will initiate the communicator repair associated with the failed process.
[erezh] we also discussed "generation" or "revision" of a process rank to identify if a process was recycled. I think that we ended up saying that it's not really required and it's the application responsibility to identify a restored process where there might be a dependency on prev communication (with other ranks)

 *   If a process wants to be notified of process failure even if it is not communicating directly with this process, it must register for this notification.
 *   We don't have enough information to know how to continue with support for checkpoint/restart.
[erezh] we discussed system level checkpoint/restart versus application aware checkpoint restart

 *   We need to discuss what needs to do with respect to failure of collective communications.
[erezh] we raised the issue of identifying asymmetric view of the communicator after a "hole" repair and its impact on collectives (e.g., the link between ranks 2 and 3 is broken but they can both comm. With rank 1) . Furthermore we explored some solution by adding information to the collective message(s) to identify that the communicator view is consistent. (we said that it requires further exploration)

There are several issues that came up with respect to these, which will be detailed later on.

Rich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081022/9ff91c77/attachment-0001.html>


More information about the mpiwg-ft mailing list