[Mpi3-ft] system-level C/R requirements

Greg Bronevetsky bronevetsky1 at llnl.gov
Mon Oct 27 12:24:20 CDT 2008

I think that solution (b) can be done without 
explicit coordination help from the communication 
library if the only protocol we care about is 
sync-and-stop. You're right though, it would be 
easier if the communication library invoked the 
sequential checkpointer than the other way 
around. This would also take care of protocols 
where we need to force checkpoints. One thing 
that we'll need to be careful about is situations 
where multiple communication libraries are 
involved such as with hybrid MPI/OpenMP 
applications. In theory it is possible to use 
complex protocols at both levels of parallelism 
and we need to be careful to not make some protocol combinations impossible.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 07:18 AM 10/27/2008, Thomas Herault wrote:

>Le 27 oct. 08 à 14:59, Mike Heffner a écrit :
>>I was referring to the MPI stack supporting the PREPARE and RESTORE
>>calls. Ideally we would like to take existing MPI applications and
>>support C/R support without any modifications to their source.
>>Similarly, if we can separate the requirements of C/R restart from
>>the application design this allows the developer to focus purely on
>>the problem their application is trying to solve and not on C/R
>As a foreword, I agree with Josh that we discussed already this in
>Chicago, and we decided to first try to support the application-based
>fault tolerance, and see later if we can/should 
>help the system-level/ transparent fault-tolerance things.
>As it has already been said in the list, transparent checkpointing
>(i.e. checkpointing without modification of the application code) can
>be done without an official MPI routine exposed by the MPI library. It
>depends on how you stack the checkpointing mechanism and the
>communication library.
>There are at least two ways to do system-level checkpointing with MPI
>a) If the communication library is using (i.e. calling) a
>checkpointing mechanism, it can implement transparent system-level
>checkpointing. It will ensure the network quiescence when/if needed,
>since it is handling the network communications anyway. In this case,
>the checkpointing mechanism saves only one process, and it's the role
>of the MPI library to build a distributed snapshot from the collection
>of processes checkpoints.
>b) If, on the other hand, you want to put the checkpointing mechanism
>below the MPI library (so, the MPI library does not use the
>checkpointing mechanism), you cannot ensure the coherency of the
>different checkpoints without help from the communication library.
>Hence, you need PREPARE/RESTORE calls exposed from the MPI library to
>the lower level.
>With solution a), we are simply library/checkpoint mechanism
>dependent. With solution b), we require from every MPI library to
>implement a not-that-easy-to-define calls pairs, and it seems to me
>that the real semantics of these calls will be system dependent as
>already pointed in the list.
>So, my question is: what are the benefits of solution b) as compared
>to solution a)?
>>However, if you intend to support C/R in a way that is completely
>>separate from the MPI use of the application then you will have
>>scenarios that a checkpoint is initiated while the application is
>>blocking on an MPI_Recv(), MPI_Barrier(), or any other blocking
>>operation. Therefore, for the most value the PREPARE and RESTORE
>>operations should be able to be invoked asynchronously during the
>>application's MPI communication.
>>Supalov, Alexander wrote:
>>>Thanks. What stack do you mean here? -----Original Message-----
>>>From: mpi3-ft-bounces at lists.mpi-forum.org
>>>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Mike
>>>Sent: Saturday, October 25, 2008 2:48 AM
>>>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>Subject: Re: [Mpi3-ft] system-level C/R requirements
>>>Supalov, Alexander wrote:
>>>>Thanks. I think the word "how" below is decisive.
>>>>The definition of MPI_Init and MPI_Finalize do not say "how"
>>>>are created, and still, they work. Likewise, as soon as we can
>>>>the expected outcome of the proposed calls, we can offload the "how"
>>>>the system - in this case, the CR system.
>>>>Now we come to the expected outcome. Imagine we guarantee that
>>>>no MPI communication between the PREPARE and RESTORE calls, and no
>>>>messages stuck in the wire or in the buffers. What can be stored in
>>>>system memory covered by CR will be stored there. The rest will be
>>>>restored by the RESTORE call once it gets control over this memory
>>>>back. This may include reinitialization of the networking hardware,
>>>>reestablishment of connections, reopening of the files, etc.
>>>>What other guarantees do CR people want?
>>>If the stack supported these calls asynchronously during MPI
>>>communication -- either from a signal handler or from a second
>>>thread --
>>>then I think that definition would go a fair way towards what would
>>>be required.
>>mpi3-ft mailing list
>>mpi3-ft at lists.mpi-forum.org
>>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list