[Mpi3-ft] system-level C/R requirements

Greg Bronevetsky bronevetsky1 at llnl.gov
Fri Oct 24 15:38:01 CDT 2008


My problem is that while this has clean semantics for an 
application-level checkpointer, the same is not true for a 
system-level checkpointer. In the latter case the checkpointer can 
exactly capture an undefined subset of MPI state (ex: main memory 
state but not network card state). As such, 
MPI_PREPARE_FOR_CHECKPOINT would essentially tell MPI to pull all of 
its state into the subset that can be checkpointed by the 
system-level checkpointer. However, the subset depends closely on the 
type of checkpointing being performed, which itself is very 
system-specific. As such, I don't know how to provide semantics for 
this call without using low-level language.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 01:32 PM 10/24/2008, Supalov, Alexander wrote:
>Hi everybody,
>
>I'm afraid we're overcomplicating things a little here. What we need are
>basically two collective calls:
>
>MPI_PREPARE_FOR_CHECKPOINT
>MPI_RESTART_AFTER_CHECKPOINT
>
>The former is (almost) like MPI_Finalize, the latter is (almost) like
>MPI_Init. What they mean is up to the implementation, with one
>condition: it must be possible to do actual checkpoint/restart in
>between.
>
>I cannot exclude that the exact meaning of the calls and the notice will
>be influenced by implementation details like memory registration, the
>checkpoint/restart system used, the network involved, etc.
>
>These collective calls may be complemented by individual, non-collective
>calls if needed. They will be suitable for individual
>checkpoint/restart, and the user will have to make sure no bad things
>happen, like messages trying to reach a process, the memory of which is
>currently being dumped.
>
>Best regards.
>
>Alexander
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
>Bronevetsky
>Sent: Friday, October 24, 2008 10:21 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group;
>MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] system-level C/R requirements
>
>
> >Agreed -- specifiying an explicit list of platforms or OS or even
> >resource specifics is not the way to go in a standard.
> >
> >My suggestion would be to explore if we can define abstract,
> >higher-level resources to define a "state", and specify high-level
> >actions. For instance, pinning/unpinning memory is very specific to
> >RDMA, but maybe a "disconnect virtual connection" operation may
> >abstract it. But this puts us into the realm of virtualizing MPI
> >internal components/concepts ..
> >
> >Maybe there is a more elegant way ...
>The thing that worries me is that an MPI implementation may have a
>fair amount of state sitting on the network card. This state is
>unreachable by a user- or kernel-level checkpointer but may be
>reachable by a VMM-level checkpointer. How do we differentiate the
>level at which we're working? System-level checkpointers working at
>different levels need MPI state to be flushed to different levels of
>abstraction and it seems that we'll need to be very low-level in
>order to define what it means to operate at a given level of
>abstraction.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list