[Mpi3-ft] system-level C/R requirements

Supalov, Alexander alexander.supalov at intel.com
Fri Oct 24 15:46:30 CDT 2008


Probably we should not try to define this, and encourage
checkpoint/restart people work with particular MPI implementations to
make sure things work? After all, it's not about an abstract MPI and
abstract CR system - there will probably always be actual pairs (or
other relationships) of them that can work together.

Compare this to the situation with threads. MPI acknowledges their
existence and provide a couple of calls to request a particular level of
support, that's all. This was good enough for starters, and may change
in the future. The relation between MPI and CR may go this way, too:
acknowledging first, integrating next.

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
Bronevetsky
Sent: Friday, October 24, 2008 10:38 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group;
MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] system-level C/R requirements

My problem is that while this has clean semantics for an 
application-level checkpointer, the same is not true for a 
system-level checkpointer. In the latter case the checkpointer can 
exactly capture an undefined subset of MPI state (ex: main memory 
state but not network card state). As such, 
MPI_PREPARE_FOR_CHECKPOINT would essentially tell MPI to pull all of 
its state into the subset that can be checkpointed by the 
system-level checkpointer. However, the subset depends closely on the 
type of checkpointing being performed, which itself is very 
system-specific. As such, I don't know how to provide semantics for 
this call without using low-level language.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 01:32 PM 10/24/2008, Supalov, Alexander wrote:
>Hi everybody,
>
>I'm afraid we're overcomplicating things a little here. What we need
are
>basically two collective calls:
>
>MPI_PREPARE_FOR_CHECKPOINT
>MPI_RESTART_AFTER_CHECKPOINT
>
>The former is (almost) like MPI_Finalize, the latter is (almost) like
>MPI_Init. What they mean is up to the implementation, with one
>condition: it must be possible to do actual checkpoint/restart in
>between.
>
>I cannot exclude that the exact meaning of the calls and the notice
will
>be influenced by implementation details like memory registration, the
>checkpoint/restart system used, the network involved, etc.
>
>These collective calls may be complemented by individual,
non-collective
>calls if needed. They will be suitable for individual
>checkpoint/restart, and the user will have to make sure no bad things
>happen, like messages trying to reach a process, the memory of which is
>currently being dumped.
>
>Best regards.
>
>Alexander
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
>Bronevetsky
>Sent: Friday, October 24, 2008 10:21 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group;
>MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] system-level C/R requirements
>
>
> >Agreed -- specifiying an explicit list of platforms or OS or even
> >resource specifics is not the way to go in a standard.
> >
> >My suggestion would be to explore if we can define abstract,
> >higher-level resources to define a "state", and specify high-level
> >actions. For instance, pinning/unpinning memory is very specific to
> >RDMA, but maybe a "disconnect virtual connection" operation may
> >abstract it. But this puts us into the realm of virtualizing MPI
> >internal components/concepts ..
> >
> >Maybe there is a more elegant way ...
>The thing that worries me is that an MPI implementation may have a
>fair amount of state sitting on the network card. This state is
>unreachable by a user- or kernel-level checkpointer but may be
>reachable by a VMM-level checkpointer. How do we differentiate the
>level at which we're working? System-level checkpointers working at
>different levels need MPI state to be flushed to different levels of
>abstraction and it seems that we'll need to be very low-level in
>order to define what it means to operate at a given level of
>abstraction.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.





More information about the mpiwg-ft mailing list