[Mpi3-ft] system-level C/R requirements

Thomas Herault herault.thomas at gmail.com
Mon Oct 27 13:50:51 CDT 2008

Le 27 oct. 08 à 17:16, Joseph Ruscio a écrit :

> On Oct 27, 2008, at 7:18 AM, Thomas Herault wrote:
>> So, my question is: what are the benefits of solution b) as  
>> compared to solution a)?
> If you go with solution b), the only thing the MPI implementation  
> needs to worry about is PREPAREing and RESTOREing the network state.  
> All of the checkpoint specific issues and complexity such as queuing  
> system integration, checkpoint file storage options like redundant  
> local storage, etc.

Except that, as pointed out before, what PREPARE and RESTORE must do  
is very system and checkpoint mechanism dependent.

> In case a), MPI implementations need to concern themselves with  
> these issues AND support the non-standard checkpoint invocation  
> API's for every different desired system checkpointer.

Agreed. The point of case a) is that we modify the MPI library knowing  
exactly what the checkpoint mechanism saves. Since we are handling the  
network specific parts, we can find workarounds for things like pinned  
pages, without too much impact on the network state (like having to  
completely close the network interface, to re-open it in RESTORE).

> So that's a question of whether the responsibility for system-level  
> checkpointer integration lays on the MPI implementor or the  
> individual checkpoint implementors.
> Going with b) gives MPI an opportunity to minimally specify an  
> integration mechanism with system-level checkpointer's. If the MPI  
> implementor does not wish to worry about these classes of CP/R  
> systems, they just don't implement the minimal set of calls. If they  
> do want the support, they implement the calls.

I agree that the point of case b) would be to relieve the data  
movement/saving/restoring part. However, one could argue that MPI is  
also well-placed to do this part (using MPI-IO for example). I am not  
convinced that this will give an opportunity to *minimally* specify  
the integration mechanism. This would be minimal in the number of  
calls to implement, but the cost of these calls would be nothing but  

> A single implementation of PREPARE and RESTORE would support most  
> system-level checkpointers. For example our checkpointer that sits  
> completely in user-land and BLCR that sits completely in the OS  
> would have the same set of requirements. VMM level checkpointers  
> have been suggested as being different. Wouldn't they either have  
> the same requirements, or be completely transparent to the MPI stack  
> i.e. snapshotting the OS, communication device, etc and allowing the  
> protocol to sort out lost messages?

Moreover, to have an integration mechanism which support most system- 
level checkpointers, it seems to me that you imply that PREPARE/ 
RESTORE would be generic. Then, we need to define more precisely what  
they would do. The current proposition of Alexander is to have a vague  
definition (like PREPARE put the MPI in a checkpointable state,  
RESTORE restores communications, and MPI is authorized to do something  
that would not be checkpointable), and to let implementations find a  
way to implement this specification. We would end with  
implementation / checkpointing systems pairs, like Alexander said. So,  
I don't really see the benefit as compared to case a) (where we wold  
also have implementation / checkpointing systems pairs).

> cheers,
> Joe


More information about the mpiwg-ft mailing list