[Mpi3-ft] Communicator Virtualization as a step forward
Supalov, Alexander
alexander.supalov at intel.com
Wed Feb 25 08:18:54 CST 2009
I want to be able to tell, in a standard manner, where in my MPI application the checkpointer may kick in in any of the so many ways that exist. That's what standard interfaces are about. What they do under the hood is a different matter.
When checkpointers are ready for this, they will define their own standard interface to use, and provide a set of requirements for those interfaces to work with the MPI checkpointing calls. The MPI implementations will probably comply in due time.
This is like I call MPI_Init when I'm ready for a process to become a part of the MPI job. I don't care how this happens. I just need this happen here. Again, a standard interface provides this capability.
I don't think that the intrusive interface that you mentioned will fly. The start/stop one may.
-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
Sent: Thursday, February 19, 2009 5:25 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
I can think of two APIs. One is something very intrusive where the
application participates in logging every single message and
non-deterministic event since there are different ways to do so,
depending on the protocol. This would require nasty things such as a
callback for every incoming and outgoing low-level message.
The other API is to just let MPI take care of it with no support or
interference from the application. This is, frankly, the only
realistic option. In your response to Thomas' email you explained in
great detail how an API for synch-and-stop checkpointing would work
and how the MPI implementation would negotiate the relevant details
with the checkpointer but what you didn't explain is why we need a
standardized application-level API to begin with. Why can't MPI and
the checkpointer do everything under the covers? We already have a
bunch of MPIs and checkpointers that already do this? What's wrong with them?
What is wrong is that this is hard for checkpointing vendors to do
and they want MPI to provide them with a simple quiescence interface
(this was the main motivation for the checkpointing API). However, as
we've already determined, this cannot be specified inside the MPI
spec. And so I ask again, what's the point of standardizing the two
calls you wish to standardize?
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
http://greg.bronevetsky.com
At 03:31 AM 2/19/2009, Supalov, Alexander wrote:
>Thanks. What would be the best API for this message logging
>approach? It looks a little bit like journal file system, doesn't it?
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Wednesday, February 18, 2009 8:10 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>
> >First, this API works only for the synch-and-stop protocol. It does
> >not support any other protocol and this will lead into bad places in
> >the future because this protocol scales poorly with increasing
> >failure rates since it requires all processes to participate in every
> >single recovery. I can discuss the reasons for my assertion in more
> >detail if you'd like.
> >
> >AS> What other protocols are you talking about? I did not tell
> >anywhere whether these calls synch anything, or whether they are
> >collective, or whether the checkpointer is called on all nodes. This
> >remains to be defined. If you would please define other scenarios,
> >including those that scale well, I bet we'd probably be able to map
> >them to these calls that way or another.
>Ok, lets look as message logging. Each process checkpoints
>independently but logs all non-deterministic events and the data of
>sent or received messages (details and overheads depend on protocol).
>If a process checkpoints, we can erase all the logged non-determinism
>or message data that relates to events before this checkpoint. If
>some process fails, it rolls back to its last checkpoint and all of
>the other processes resend to 1. the data of all messages they sent
>to it since that checkpoint and 2. the outcomes of all the
>nonderministic events that it experienced since that checkpoint.
>
>I know how to implement this either inside MPI or above MPI with the
>aid for the fault notification API. These things already exist, for
>example in MPICH-V. I can also imagine an API that lets the
>application ask MPI to run this protocol on its behalf. However, this
>API will not require quiescence or stable states, or anything like
>that. In particular, quiescence is completely incompatible with
>message logging. Realistically speaking, this procotol can only be
>provided to the application without any direct interaction with the
>application. This is fine but it also suggests that we don't need
>much of an API to support it. The same applies to other protocols
>that need to communicate before, after and/or during the checkpoint.
>
>
>As for your other points, Thomas' email covers roughly the same
>points that I would have made.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>http:// greg.bronevetsky.com
>
> >Second, this API is not compatible with application-level
> >checkpointing. An application-level checkpointer can deal with
> >internal MPI state in one of two ways. One option is to checkpoint it
> >at a high-level by using only the existing MPI calls, performing the
> >appropriate tracking of messages at calls to MPI_Send and MPI_Recv.
> >The simplest example of this is to ensure that there is no
> >communication going on at the time of the checkpoint, and then save
> >each process' state immediately after calling MPI_Barrier(). If this
> >is the option used, there is no need for any API extensions.
> >
> >AS> One option for the application level checkpointing is to save
> >the apps specific state (those two vectors) without saving the state
> >of the MPI at all. This will work if the MPI state is quiet (no
> >connections, nothing), and the application can be sure that its
> >state will be good, and when it restarts, the MPI hitting the
> >MPI_restore_after_checkpoint will know how to re-establish all
> >connections, etc. I think this is more than an MPI_Barrier, isn't it?
> >
> >The second option is for the application to somehow same MPI internal
> >data structures, which is what the MPI_Prepare_for_checkpoint call is
> >for: to make sure that all MPI state is in locations accessible by
> >the application. Where is that? Well, it depends on the operating
> >system and the hardware. Does the application need to checkpoint and
> >restore MPI's open file handles or shared memory regions or is MPI
> >supposed to let those go during MPI_Prepare_for_checkpoint? The point
> >is, all these details cannot be defined in the MPI spec and if this
> >is the case, the call will not be sufficiently well-defined to be
> >used by an application-level checkpointer. However, it can be used by
> >the application to initiate a system-level checkpoint since in this
> >case all those details will not be the application's problem.
> >
> >AS> As for the place where all relevant MPI data is to be stored at
> >this point in time, we should not care - the MPI implementation and
> >the checkpointer involved will agree on this when the MPI will be
> >configured to use this particular checkpointer, either statically or
> >dynamically. The checkpointer will know that the data is in the
> >"right" place when it gats control after the
> MPI_prepare_for_checkpoint call.
> >
> >Finally, it does support system-initiated checkpointing. If the
> >system wants to start a checkpoint in the middle of an MPI_Send()
> >call, this API provides no help. If we care about this use-case we
> >need to allow checkpoints to be taken at all times or provide a way
> >to register a callback that tells the system when a checkpoint can
> >next be taken.
> >
> >AS> I suppose this is "does not support", right? If so, let's simply
> >add this callback registration call, say,
> >MPI_REGISTER_CHECKPOINT_CALLBACK, that will define the call to be
> >made when the MPI_Prepare_for_checkpoint is about to return control
> >to the application. This will I guess address your concern, won't it?
> >
> >The bottom line given the above issues is that the only benefit this
> >API does have is providing a common set of names for functions that
> >have poor semantics and limited functionality. This benefit does not
> >pass the bar.
> >
> >AS> See above. I think you raise valid points, but all of them can
> >be adequately addressed given the desire to provide the
> >checkpoint-restart capability
> >
> >If I'm allowed to digress here and refer back to the fault
> >notification, note that the checkpoint/restart capability is
> >something that does guarantee fault tolerance, also in the case when
> >fatal unrecoverable errors occur. You "simply" fix the machine and
> >restart from the earlier checkpoint.
> >
> >As such, this widely used and proven capability may be a safe first
> >bet for the FT WG to introduce as something people will like and
> >will use. Much more likely, in fact, than the notification stuff
> >that yet needs to be properly defined, and as currently envisioned,
> >with all that uncertain communicator restoration stuff, will require
> >substantial application rethinking and repgrogramming to be used.
> >
> >After all, this is only a matter of priorities. Going for a low
> >hanging fruit like checkpoint/restart may be unromantic, but it's
> >definitely very pragmatic and sound. And in any case, I call upon
> >the same set of criteria to be used for all things that we consider.
> >Usability in the sense that a feature will be used by commercial
> >apps is definitely one of them. Or so I think.
> >
> >Greg Bronevetsky
> >Post-Doctoral Researcher
> >1028 Building 451
> >Lawrence Livermore National Lab
> >(925) 424-5756
> >bronevetsky1 at llnl.gov
> >http:// greg.bronevetsky.com
> >
> >At 09:27 AM 2/18/2009, Supalov, Alexander wrote:
> > >I see. Let me explain what I think those two calls would do to
> > >facilitate checkpointing.
> > >
> > >The program piece would look as follows:
> > >
> > >CALL MPI_PREPARE_FOR_CHECKPOINT
> > >
> > >! Whatever calls are necessary to do the actual checkpointing,
> > >including, in the case of resident monitoring checkpointer, none.
> > >
> > >CALL MPI_RESTORE_AFTER_CHECKPOINT
> > >
> > >The first call will bring MPI into a state that is acceptable to the
> > >checkpointer involved. This may include quiescence, etc.
> > >
> > >Then the checkpointer will do its job. Note that this may be
> > >application level checkpointer that saves two important arrays. Or
> > >this may be a system checkpointer that dumps the memory, etc.
> > >Finally, this can be a checkpointer that is notified when the
> > >MPI_PREPARE_FOR_CHECKPOINT is about to leave and does the job in
> > >some way without the program calling it.
> > >
> > >The second call will restore the MPI state to what it was,
> > >functionally, after the checkpoint. "Functionally" here means that
> > >this may include running on a different set of nodes, etc.
> > >
> > >Is it this scenario that you think does not pass the bar? If so,
> > in what way?
> > >
> > >-----Original Message-----
> > >From: mpi3-ft-bounces at lists.mpi-forum.org
> > >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
> > >Sent: Wednesday, February 18, 2009 6:20 PM
> > >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> > >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> > >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> > >
> > >My point is that each MPI implementation will have to negotiate with
> > >each checkpointer what this call will mean on each platform, meaning
> > >that it is upto the implementation-specific call to do all that
> > >magic. It doesn't save the application any real effort to call this
> > >routine by its implementation-specific name rather than a unified
> > >name given that the other two major problems are not solved:
> > >- Providing a uniform way for checkpointers and MPI implementations
> > >to interact (THE major for having this proposal in the first place) and
> > >- Support for more than the synch-and-stop checkpointing protocol,
> > >which does scales poorly with rising failure rates.
> > >
> > >My point here is that covering one of the three major issues is not
> > >good enough to modify the MPI specification and I don't think that
> > >we'll ever be able to do more than that. We're pretty much agreed in
> > >the working group that the first point above cannot be done and I
> > >don't think that the second can be done without providing callbacks
> > >to be called for each incoming/outgoing message.
> > >
> > >Greg Bronevetsky
> > >Post-Doctoral Researcher
> > >1028 Building 451
> > >Lawrence Livermore National Lab
> > >(925) 424-5756
> > >bronevetsky1 at llnl.gov
> > >http:// greg.bronevetsky.com
> > >
> > >At 08:40 AM 2/18/2009, Supalov, Alexander wrote:
> > > >Thanks. How will the checkpointer-specific call talk to the MPI?
> > > >
> > > >-----Original Message-----
> > > >From: mpi3-ft-bounces at lists.mpi-forum.org
> > > >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
> Bronevetsky
> > > >Sent: Wednesday, February 18, 2009 5:29 PM
> > > >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> > > >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> > > >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> > > >
> > > >
> > > > >Thanks. How will you let the MPI know the checkpoint is coming, to
> > > > >give it a fair chance to prepare to this and then recover after the
> > > > >checkpoint? This is akin to the MPI_Finalize/MPI_Init in some sense,
> > > > >midway thru the job, hence the analogy.
> > > >
> > > >Just use the checkpointer-specific call. The call is going to have
> > > >checkpointer-specific semantics, so why not give it a
> > > >checkpointer-specific name? I understand that there is some use to
> > > >allowing applications to use the same name across all checkpointers
> > > >but the bar should be higher than that for adding something to the
> > > >standard. Also, right now the whole approach inherently only supports
> > > >one checkpointing protocol: synch-and-stop. If we can work out a more
> > > >generic API that supports other protocols I think that it may have
> > > >enough value to be included in the spec. Right now it still hasn't
> > > >passed the bar.
> > > >
> > > >Greg Bronevetsky
> > > >Post-Doctoral Researcher
> > > >1028 Building 451
> > > >Lawrence Livermore National Lab
> > > >(925) 424-5756
> > > >bronevetsky1 at llnl.gov
> > > >http:// greg.bronevetsky.com
> > > >
> > > >_______________________________________________
> > > >mpi3-ft mailing list
> > > >mpi3-ft at lists.mpi-forum.org
> > > >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > > >---------------------------------------------------------------------
> > > >Intel GmbH
> > > >Dornacher Strasse 1
> > > >85622 Feldkirchen/Muenchen Germany
> > > >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> > > >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> > > >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> > > >VAT Registration No.: DE129385895
> > > >Citibank Frankfurt (BLZ 502 109 00) 600119052
> > > >
> > > >This e-mail and any attachments may contain confidential material for
> > > >the sole use of the intended recipient(s). Any review or distribution
> > > >by others is strictly prohibited. If you are not the intended
> > > >recipient, please contact the sender and delete all copies.
> > > >
> > > >
> > > >_______________________________________________
> > > >mpi3-ft mailing list
> > > >mpi3-ft at lists.mpi-forum.org
> > > >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >
> > >_______________________________________________
> > >mpi3-ft mailing list
> > >mpi3-ft at lists.mpi-forum.org
> > >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >---------------------------------------------------------------------
> > >Intel GmbH
> > >Dornacher Strasse 1
> > >85622 Feldkirchen/Muenchen Germany
> > >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> > >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> > >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> > >VAT Registration No.: DE129385895
> > >Citibank Frankfurt (BLZ 502 109 00) 600119052
> > >
> > >This e-mail and any attachments may contain confidential material for
> > >the sole use of the intended recipient(s). Any review or distribution
> > >by others is strictly prohibited. If you are not the intended
> > >recipient, please contact the sender and delete all copies.
> > >
> > >
> > >_______________________________________________
> > >mpi3-ft mailing list
> > >mpi3-ft at lists.mpi-forum.org
> > >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >---------------------------------------------------------------------
> >Intel GmbH
> >Dornacher Strasse 1
> >85622 Feldkirchen/Muenchen Germany
> >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> >VAT Registration No.: DE129385895
> >Citibank Frankfurt (BLZ 502 109 00) 600119052
> >
> >This e-mail and any attachments may contain confidential material for
> >the sole use of the intended recipient(s). Any review or distribution
> >by others is strictly prohibited. If you are not the intended
> >recipient, please contact the sender and delete all copies.
> >
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the mpiwg-ft
mailing list