[Mpi3-ft] system-level C/R requirements

Fri Oct 24 15:33:37 CDT 2008

PS. Better still, add a communicator argument to these calls and be
happy: both extremes as well as anything in between will be covered. How
much will be supported is again up to the implementation. 

-----Original Message-----
From: Supalov, Alexander 
Sent: Friday, October 24, 2008 10:32 PM
To: 'MPI 3.0 Fault Tolerance and Dynamic Process Control working Group'
Subject: RE: [Mpi3-ft] system-level C/R requirements

Hi everybody,

I'm afraid we're overcomplicating things a little here. What we need are
basically two collective calls:

MPI_PREPARE_FOR_CHECKPOINT
MPI_RESTART_AFTER_CHECKPOINT

The former is (almost) like MPI_Finalize, the latter is (almost) like
MPI_Init. What they mean is up to the implementation, with one
condition: it must be possible to do actual checkpoint/restart in
between.

I cannot exclude that the exact meaning of the calls and the notice will
be influenced by implementation details like memory registration, the
checkpoint/restart system used, the network involved, etc.

These collective calls may be complemented by individual, non-collective
calls if needed. They will be suitable for individual
checkpoint/restart, and the user will have to make sure no bad things
happen, like messages trying to reach a process, the memory of which is
currently being dumped.

Best regards.

Alexander

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
Bronevetsky
Sent: Friday, October 24, 2008 10:21 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group;
MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] system-level C/R requirements

>Agreed -- specifiying an explicit list of platforms or OS or even 
>resource specifics is not the way to go in a standard.
>
>My suggestion would be to explore if we can define abstract, 
>higher-level resources to define a "state", and specify high-level 
>actions. For instance, pinning/unpinning memory is very specific to 
>RDMA, but maybe a "disconnect virtual connection" operation may 
>abstract it. But this puts us into the realm of virtualizing MPI 
>internal components/concepts ..
>
>Maybe there is a more elegant way ...
The thing that worries me is that an MPI implementation may have a 
fair amount of state sitting on the network card. This state is 
unreachable by a user- or kernel-level checkpointer but may be 
reachable by a VMM-level checkpointer. How do we differentiate the 
level at which we're working? System-level checkpointers working at 
different levels need MPI state to be flushed to different levels of 
abstraction and it seems that we'll need to be very low-level in 
order to define what it means to operate at a given level of
abstraction.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov 

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.