[Mpi3-ft] system-level C/R requirements

Greg Bronevetsky bronevetsky1 at llnl.gov
Fri Oct 24 15:38:01 CDT 2008

My problem is that while this has clean semantics for an 
application-level checkpointer, the same is not true for a 
system-level checkpointer. In the latter case the checkpointer can 
exactly capture an undefined subset of MPI state (ex: main memory 
state but not network card state). As such, 
MPI_PREPARE_FOR_CHECKPOINT would essentially tell MPI to pull all of 
its state into the subset that can be checkpointed by the 
system-level checkpointer. However, the subset depends closely on the 
type of checkpointing being performed, which itself is very 
system-specific. As such, I don't know how to provide semantics for 
this call without using low-level language.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 01:32 PM 10/24/2008, Supalov, Alexander wrote:
>Hi everybody,
>I'm afraid we're overcomplicating things a little here. What we need are
>basically two collective calls:
>The former is (almost) like MPI_Finalize, the latter is (almost) like
>MPI_Init. What they mean is up to the implementation, with one
>condition: it must be possible to do actual checkpoint/restart in
>I cannot exclude that the exact meaning of the calls and the notice will
>be influenced by implementation details like memory registration, the
>checkpoint/restart system used, the network involved, etc.
>These collective calls may be complemented by individual, non-collective
>calls if needed. They will be suitable for individual
>checkpoint/restart, and the user will have to make sure no bad things
>happen, like messages trying to reach a process, the memory of which is
>currently being dumped.
>Best regards.
> >Agreed -- specifiying an explicit list of platforms or OS or even
> >resource specifics is not the way to go in a standard.
> >
> >My suggestion would be to explore if we can define abstract,
> >higher-level resources to define a "state", and specify high-level
> >actions. For instance, pinning/unpinning memory is very specific to
> >RDMA, but maybe a "disconnect virtual connection" operation may
> >abstract it. But this puts us into the realm of virtualizing MPI
> >internal components/concepts ..
> >
> >Maybe there is a more elegant way ...
>The thing that worries me is that an MPI implementation may have a
>fair amount of state sitting on the network card. This state is
>unreachable by a user- or kernel-level checkpointer but may be
>reachable by a VMM-level checkpointer. How do we differentiate the
>level at which we're working? System-level checkpointers working at
>different levels need MPI state to be flushed to different levels of
>abstraction and it seems that we'll need to be very low-level in
>order to define what it means to operate at a given level of
