[Mpi3-ft] Summary of today's meeting

Narasimhan, Kannan kannan.narasimhan at hp.com
Thu Oct 23 14:52:57 CDT 2008

Some more notes from our discussion on the topic of MPI standard  support for "checkpoint/restart":

We grouped C/R under two categories: application-directed and system-level.
System-level C/R can be accomplished via many techniques: intercepting every level of system stack, using virtualization techniques, etc.
Application-directed C/R will still require some quiescence hooks from the MPI layer (ex: asyncronous progression by the MPI layer). There was some discussion on this.
The MPI requirements for System-level checkpointing cannot be formulated until we get more data to define a "quiet state"

I queried Mike Hefner on the sematics of freeze/unfreeze in their (Evergrid/Librato) transparent C/R approach, and here is his response :

Question 1: What is ur definition of a quiet state (after the freeze call)? Do U expect the MPI to unpin memory? free resources? or just quiet the message traffic? We need to explicitly state the semantics here ...

We defined it as a state that will provide a consistent state of the application across all processes. From the MPI standpoint, this would mean a state whereby all processes in the "freeze" state would be able to continue communication if a restart were invoked.

In terms of particular resources, our CP/R software manages storing all application and, optionally, all MPI memory. This includes memory that has been allocated by either a malloc(3) call or a mmap/mremap call. If that memory has been pinned by the IB driver, we will store it to disk as well. We also store the primary process resources in use: IPCs, shared memory, file handles and file rollback state, etc.

These memory regions and other resources are recorded after each process returns from the freeze API.

Question 2: The same goes for restore. What is expected to be there, and what is expected to be supplied as the context....

On a restore all of the memory (and other resources such as IPCs, open files, etc.) will be recreated and reloaded with the state that was recorded at checkpoint time *before* the restart API is called. On the restart, it is expected that the MPI stack reinitialize the interconnect card, recreate necessary handles for fabric communication, and re-pin all previously pinned memory regions in use by the fabric's card.


From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Erez Haba
Sent: Wednesday, October 22, 2008 8:53 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Summary of today's meeting

Thanks for capturing this.

My comments inline...

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Richard Graham
Sent: Tuesday, October 21, 2008 9:03 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: [Mpi3-ft] Summary of today's meeting

Here is a summary of what I think that we agreed to today.  Please correct any errors, and add what I am missing.

 *   We need to be able to restore MPI_COMM_WORLD (and it's derivatives) to a usable state when a process fails.
[erezh] I think that we discussed this with reference to the comment that MPI is not usable once it returned an error. we need to address that in the current standard. (I think that this should be the first item on the list)
[erezh] as I recall the second item on the list, is returning errors per call site (per the Error Reporting Rules proposal)
[erez] as for this specific items, I think that the wording should be "repair" rather than restore (when repair is either making a "hole" in the communicator or "filling" the whole with a new process.

 *   Restoration may involve having MPI_PROC_NULL replace the lost process, or may replaced the lost processes with a new process (have not specified how this would happen)
[erezh] again I would replace "restoration" with "repair"
[erezh] We said that we can use MPI_PROC_NULL for making a "hole". i.e., the communicator will not be in the error state anymore (thus you can receive from MPI_ANY_SOURCE or use a collective) however any direct communication with the "hole" rank is like using MPI_PROC_NULL.
[erezh] We also said that replacing the lost process with a new one only applies to MPI_COMM_WORD.

 *   Processes communicating directly with the failed processes will be notified via a returned error code about the failure.
 *   When a process is notified of the failure, comm_repair() must be called.  Comm_repair() is not a collective call, and is what will initiate the communicator repair associated with the failed process.
[erezh] we also discussed "generation" or "revision" of a process rank to identify if a process was recycled. I think that we ended up saying that it's not really required and it's the application responsibility to identify a restored process where there might be a dependency on prev communication (with other ranks)

 *   If a process wants to be notified of process failure even if it is not communicating directly with this process, it must register for this notification.
 *   We don't have enough information to know how to continue with support for checkpoint/restart.
[erezh] we discussed system level checkpoint/restart versus application aware checkpoint restart

 *   We need to discuss what needs to do with respect to failure of collective communications.
[erezh] we raised the issue of identifying asymmetric view of the communicator after a "hole" repair and its impact on collectives (e.g., the link between ranks 2 and 3 is broken but they can both comm. With rank 1) . Furthermore we explored some solution by adding information to the collective message(s) to identify that the communicator view is consistent. (we said that it requires further exploration)

There are several issues that came up with respect to these, which will be detailed later on.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20081023/ca55556f/attachment-0001.html>

More information about the mpiwg-ft mailing list