[Mpi3-ft] Communicator Virtualization as a step forward

Greg Bronevetsky bronevetsky1 at llnl.gov
Wed Feb 18 11:42:36 CST 2009


First, this API works only for the synch-and-stop protocol. It does 
not support any other protocol and this will lead into bad places in 
the future because this protocol scales poorly with increasing 
failure rates since it requires all processes to participate in every 
single recovery. I can discuss the reasons for my assertion in more 
detail if you'd like.

Second, this API is not compatible with application-level 
checkpointing. An application-level checkpointer can deal with 
internal MPI state in one of two ways. One option is to checkpoint it 
at a high-level by using only the existing MPI calls, performing the 
appropriate tracking of messages at calls to MPI_Send and MPI_Recv. 
The simplest example of this is to ensure that there is no 
communication going on at the time of the checkpoint, and then save 
each process' state immediately after calling MPI_Barrier(). If this 
is the option used, there is no need for any API extensions.

The second option is for the application to somehow same MPI internal 
data structures, which is what the MPI_Prepare_for_checkpoint call is 
for: to make sure that all MPI state is in locations accessible by 
the application. Where is that? Well, it depends on the operating 
system and the hardware. Does the application need to checkpoint and 
restore MPI's open file handles or shared memory regions or is MPI 
supposed to let those go during MPI_Prepare_for_checkpoint? The point 
is, all these details cannot be defined in the MPI spec and if this 
is the case, the call will not be sufficiently well-defined to be 
used by an application-level checkpointer. However, it can be used by 
the application to initiate a system-level checkpoint since in this 
case all those details will not be the application's problem.

Finally, it does support system-initiated checkpointing. If the 
system wants to start a checkpoint in the middle of an MPI_Send() 
call, this API provides no help. If we care about this use-case we 
need to allow checkpoints to be taken at all times or provide a way 
to register a callback that tells the system when a checkpoint can 
next be taken.

The bottom line given the above issues is that the only benefit this 
API does have is providing a common set of names for functions that 
have poor semantics and limited functionality. This benefit does not 
pass the bar.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
http://greg.bronevetsky.com

At 09:27 AM 2/18/2009, Supalov, Alexander wrote:
>I see. Let me explain what I think those two calls would do to 
>facilitate checkpointing.
>
>The program piece would look as follows:
>
>CALL MPI_PREPARE_FOR_CHECKPOINT
>
>! Whatever calls are necessary to do the actual checkpointing, 
>including, in the case of resident monitoring checkpointer, none.
>
>CALL MPI_RESTORE_AFTER_CHECKPOINT
>
>The first call will bring MPI into a state that is acceptable to the 
>checkpointer involved. This may include quiescence, etc.
>
>Then the checkpointer will do its job. Note that this may be 
>application level checkpointer that saves two important arrays. Or 
>this may be a system checkpointer that dumps the memory, etc. 
>Finally, this can be a checkpointer that is notified when the 
>MPI_PREPARE_FOR_CHECKPOINT is about to leave and does the job in 
>some way without the program calling it.
>
>The second call will restore the MPI state to what it was, 
>functionally, after the checkpoint. "Functionally" here means that 
>this may include running on a different set of nodes, etc.
>
>Is it this scenario that you think does not pass the bar? If so, in what way?
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org 
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Wednesday, February 18, 2009 6:20 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>My point is that each MPI implementation will have to negotiate with
>each checkpointer what this call will mean on each platform, meaning
>that it is upto the implementation-specific call to do all that
>magic. It doesn't save the application any real effort to call this
>routine by its implementation-specific name rather than a unified
>name given that the other two major problems are not solved:
>- Providing a uniform way for checkpointers and MPI implementations
>to interact (THE major for having this proposal in the first place) and
>- Support for more than the synch-and-stop checkpointing protocol,
>which does scales poorly with rising failure rates.
>
>My point here is that covering one of the three major issues is not
>good enough to modify the MPI specification and I don't think that
>we'll ever be able to do more than that. We're pretty much agreed in
>the working group that the first point above cannot be done and I
>don't think that the second can be done without providing callbacks
>to be called for each incoming/outgoing message.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>http:// greg.bronevetsky.com
>
>At 08:40 AM 2/18/2009, Supalov, Alexander wrote:
> >Thanks. How will the checkpointer-specific call talk to the MPI?
> >
> >-----Original Message-----
> >From: mpi3-ft-bounces at lists.mpi-forum.org
> >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
> >Sent: Wednesday, February 18, 2009 5:29 PM
> >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> >
> >
> > >Thanks. How will you let the MPI know the checkpoint is coming, to
> > >give it a fair chance to prepare to this and then recover after the
> > >checkpoint? This is akin to the MPI_Finalize/MPI_Init in some sense,
> > >midway thru the job, hence the analogy.
> >
> >Just use the checkpointer-specific call. The call is going to have
> >checkpointer-specific semantics, so why not give it a
> >checkpointer-specific name? I understand that there is some use to
> >allowing applications to use the same name across all checkpointers
> >but the bar should be higher than that for adding something to the
> >standard. Also, right now the whole approach inherently only supports
> >one checkpointing protocol: synch-and-stop. If we can work out a more
> >generic API that supports other protocols I think that it may have
> >enough value to be included in the spec. Right now it still hasn't
> >passed the bar.
> >
> >Greg Bronevetsky
> >Post-Doctoral Researcher
> >1028 Building 451
> >Lawrence Livermore National Lab
> >(925) 424-5756
> >bronevetsky1 at llnl.gov
> >http://  greg.bronevetsky.com
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >---------------------------------------------------------------------
> >Intel GmbH
> >Dornacher Strasse 1
> >85622 Feldkirchen/Muenchen Germany
> >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> >VAT Registration No.: DE129385895
> >Citibank Frankfurt (BLZ 502 109 00) 600119052
> >
> >This e-mail and any attachments may contain confidential material for
> >the sole use of the intended recipient(s). Any review or distribution
> >by others is strictly prohibited. If you are not the intended
> >recipient, please contact the sender and delete all copies.
> >
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list