[Mpi3-ft] system-level C/R requirements

Thomas Herault herault.thomas at gmail.com
Mon Oct 27 09:18:51 CDT 2008

Le 27 oct. 08 à 14:59, Mike Heffner a écrit :

> I was referring to the MPI stack supporting the PREPARE and RESTORE  
> calls. Ideally we would like to take existing MPI applications and  
> support C/R support without any modifications to their source.  
> Similarly, if we can separate the requirements of C/R restart from  
> the application design this allows the developer to focus purely on  
> the problem their application is trying to solve and not on C/R  
> requirements.

As a foreword, I agree with Josh that we discussed already this in  
Chicago, and we decided to first try to support the application-based  
fault tolerance, and see later if we can/should help the system-level/ 
transparent fault-tolerance things.

As it has already been said in the list, transparent checkpointing  
(i.e. checkpointing without modification of the application code) can  
be done without an official MPI routine exposed by the MPI library. It  
depends on how you stack the checkpointing mechanism and the  
communication library.

There are at least two ways to do system-level checkpointing with MPI  

a) If the communication library is using (i.e. calling) a  
checkpointing mechanism, it can implement transparent system-level  
checkpointing. It will ensure the network quiescence when/if needed,  
since it is handling the network communications anyway. In this case,  
the checkpointing mechanism saves only one process, and it's the role  
of the MPI library to build a distributed snapshot from the collection  
of processes checkpoints.

b) If, on the other hand, you want to put the checkpointing mechanism  
below the MPI library (so, the MPI library does not use the  
checkpointing mechanism), you cannot ensure the coherency of the  
different checkpoints without help from the communication library.  
Hence, you need PREPARE/RESTORE calls exposed from the MPI library to  
the lower level.

With solution a), we are simply library/checkpoint mechanism  
dependent. With solution b), we require from every MPI library to  
implement a not-that-easy-to-define calls pairs, and it seems to me  
that the real semantics of these calls will be system dependent as  
already pointed in the list.
So, my question is: what are the benefits of solution b) as compared  
to solution a)?


> However, if you intend to support C/R in a way that is completely  
> separate from the MPI use of the application then you will have  
> scenarios that a checkpoint is initiated while the application is  
> blocking on an MPI_Recv(), MPI_Barrier(), or any other blocking  
> operation. Therefore, for the most value the PREPARE and RESTORE  
> operations should be able to be invoked asynchronously during the  
> application's MPI communication.
> Mike
> Supalov, Alexander wrote:
>> Thanks. What stack do you mean here? -----Original Message-----
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Mike  
>> Heffner
>> Sent: Saturday, October 25, 2008 2:48 AM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] system-level C/R requirements
>> Supalov, Alexander wrote:
>>> Thanks. I think the word "how" below is decisive.
>>> The definition of MPI_Init and MPI_Finalize do not say "how"  
>>> processes
>>> are created, and still, they work. Likewise, as soon as we can  
>>> define
>>> the expected outcome of the proposed calls, we can offload the "how"
>> to
>>> the system - in this case, the CR system.
>>> Now we come to the expected outcome. Imagine we guarantee that  
>>> there's
>>> no MPI communication between the PREPARE and RESTORE calls, and no
>>> messages stuck in the wire or in the buffers. What can be stored in
>> the
>>> system memory covered by CR will be stored there. The rest will be
>>> restored by the RESTORE call once it gets control over this memory
>> image
>>> back. This may include reinitialization of the networking hardware,
>>> reestablishment of connections, reopening of the files, etc.
>>> What other guarantees do CR people want?
>> If the stack supported these calls asynchronously during MPI  
>> communication -- either from a signal handler or from a second  
>> thread --
>> then I think that definition would go a fair way towards what would  
>> be required.
>> Mike
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

More information about the mpiwg-ft mailing list