[Mpi3-ft] system-level C/R requirements

Josh Hursey jjhursey at open-mpi.org
Mon Oct 27 07:44:03 CDT 2008


Just to throw my two cents in here.

I think the question of how to support checkpoint/restart in MPI is  
important. One question that I proposed for further thought is:
  Why can't an application call into a system level C/R system  
directly when running with MPI? Why does it need MPI's support here?
As a related, but maybe slightly less complex question:
  What is hindering application level checkpoint/restart mechanisms  
when running with MPI?

The answer to the former question is a combination of possibly saving  
internal state of the MPI, interconnect support, and message tracking  
(and more). This is a complex issue, per the discussion on and off the  
list so far. I like the spirit of some of the proposals, particularly  
a simple, non-binding API to signal the intentions of the application.  
I think these need more refinement, and I don't have a clear idea for  
a solution yet.

The answer to the latter is, in my opinion, not much. Applications  
already take their own internal checkpoints and can restart from them  
during a relaunch of MPI. If we rephrase the question to be:
  How can we better support application level checkpoint/restart with  
MPI?

Then we have something we can explore. At large scale, relaunching the  
entire application just to recover from the failure of a small subset  
of process failures using an application level checkpoint may be too  
time consuming (and largely overkill). Instead we should think about  
how we define the MPI state after a process failure so the application  
can continue running. And mechanisms for allowing the application to  
replace processes. In the meeting in Chicago last week, I suggested  
that we may want to look at Spawn as a way to reintroduce processes to  
a job. I'm sure there are other options to explore here.

Though I am really enjoying the conversation on the list so far with  
regard to system-level C/R, I think we should not forget to direct  
some of our energy on immediate, broad reaching goals. I'm not  
proposing that we stop conversation on C/R and MPI, but encourage them  
to continue. What I am trying to suggest is that we first focus on the  
foundation, and develop solid proposals there before we turn our  
attention solely to the more involved and controversial proposals on  
the table.

Some foundational questions might include:
  - How does an application signal to the MPI implementation that it  
wishes to enable the fault tolerance features of MPI (vs.  
MPI_ERRORS_RETURN or MPI_ERRORS_FATAL)
  - What is the state of MPI after a process failure with FT enabled?
  - What is the state of communicators?
  - If we leave a 'hole' in the communicator how do we allow the  
application to interact with such a sparse communicator?
  - What is the state of all of the MPI interfaces when an error occurs?
  - Do we want to discuss how to re-introduce a process to an already  
existing communicator?
  - Do we want to discuss the ability to grow and shrink a communicator?
  - We need to state clearly the theoretical bounds of any proposal on  
the table? (i.e. we cannot detect failure 100% of the time, but we can  
get close)

My impression is that proposals directed at these fundamental  
questions are of immediate interest to the forum. Other topics such as  
piggybacking, C/R, replication (to name a few) must build on these  
foundational topics. So while we are trying to address these questions  
I would encourage those interested in the C/R proposal to continue  
discussion. This way once the foundation is established we come out of  
the C/R discussion with either a solid proposal for the group, or a  
list of open research questions that need to be addressed before such  
a proposal is possible.

-- Josh

On Oct 25, 2008, at 2:03 PM, Greg Bronevetsky wrote:

> We're trying to come up with a good set of semantics that would  
> serve a variety of system-level C/R vendors. I'm sure that if we  
> target the needs of EverGrid or BLCR on Linux then we will succeed.  
> However, that is not our goal. Our goal is to target user-level,  
> kernel-level and VMM-level C/R (or hybrids) on the full range of  
> platforms that may be supported by MPI. Focusing on Linux for the  
> moment, lets look at the difference between user-level C/R and VMM- 
> level C/R. VMM-level C/R sees a very large fraction of the system,  
> meaning that it will likely be acceptable for the MPI_Prepare call  
> to pull all message state into CPU or network card memory and just  
> ensure that there is not message data on the wires and the switches.  
> User-level C/R is much less capable, meaning that all MPI state must  
> be not just in CPU RAM, but specifically in application-accessible  
> memory space and may not have any problematic kernel-level  
> attributes such as pinned/unpinned status. When interfacing a given  
> C/R tool with MPI these details become important and we must specify  
> them explicitly. If we do not, we'll devolve into the current  
> situation where each C/R tool must interface with each MPI vendor to  
> get things to work.
>
> Moving to other platforms that may not have the same definitions of  
> user-level/kernel-level/VMM-level as does Linux, we have a much  
> deeper problem. A given type of checkpointer virtualizes the system  
> at a given level of abstraction, meaning that in order to work with  
> such a checkpointer MPI_Prepare must move all MPI state above that  
> level. Even if we take the most conservative approach of forcing  
> MPI_Prepare to move MPI state to the highest level possible, this  
> still has the basic language problem of specifying what this might  
> mean on platforms that don't even exist today. I just don't know how  
> to do that.
>
> As such, the only plausible way that I can see for us to provide  
> checkpointing support within the MPI standard is to move this  
> virtualization all the way to the level of the MPI specification and  
> keep MPI internals as a black box. If the application wants to  
> checkpoint MPI, it must use the MPI interface to checkpoint the  
> application-visible state. If it wants to checkpoint MPI datatypes,  
> it can use PMPI to wrap all datatype management calls and keep track  
> of them on its own. If it wants to checkpoint message state, it can  
> use piggybacking and the standard MPI communication calls to do the  
> appropriate logging and coordination. We may want to add extra MPI  
> calls to facilitate this (piggybacking is one such example) but the  
> main point of this approach is that it works at a level that is  
> natural for the MPI specification and doesn't force us to define  
> internal details that have so far been left unspecified by the  
> standard.
>
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>
> At 03:09 PM 10/24/2008, Supalov, Alexander wrote:
>> Thanks. I think the word "how" below is decisive.
>>
>> The definition of MPI_Init and MPI_Finalize do not say "how"  
>> processes
>> are created, and still, they work. Likewise, as soon as we can define
>> the expected outcome of the proposed calls, we can offload the  
>> "how" to
>> the system - in this case, the CR system.
>>
>> Now we come to the expected outcome. Imagine we guarantee that  
>> there's
>> no MPI communication between the PREPARE and RESTORE calls, and no
>> messages stuck in the wire or in the buffers. What can be stored in  
>> the
>> system memory covered by CR will be stored there. The rest will be
>> restored by the RESTORE call once it gets control over this memory  
>> image
>> back. This may include reinitialization of the networking hardware,
>> reestablishment of connections, reopening of the files, etc.
>>
>> What other guarantees do CR people want?
>>
>> -----Original Message-----
>> From: mpi3-ft-bounces at lists.mpi-forum.org
>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
>> Bronevetsky
>> Sent: Friday, October 24, 2008 11:38 PM
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working  
>> Group;
>> MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> Subject: Re: [Mpi3-ft] system-level C/R requirements
>>
>> The imprecision comes from the MPI library's interactions with the
>> C/R tool. It seems to me that each tool/MPI library combo will have
>> to define their own semantics for the two calls. The only standard
>> here would be that these functions would in fact need to be called.
>> But at this point its not really an API but a mild constraint on how
>> the real API would be used. As such, it doesn't carry much
>> information about how checkpointing is to be done. What the
>> system-level C/R people need is a set of guarantees about where MPI
>> will put its state between the two calls. I don't see a way to give
>> them such guarantees in a standardized way. This is where we get
>> stuck: we either provide a couple of calls that has little
>> informational content but rather server as placeholders for a real or
>> go fully detailed on a platform-by-platform basic. Neither approach
>> is likely to pass by the wider forum, which is why I don't know how
>> to satisfy this need within the MPI 3.0 effort.
>>
>> It looks to me like we're standardizing something that is already
>> non-standard.
>>
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>>
>> At 02:27 PM 10/24/2008, Supalov, Alexander wrote:
>> >Thanks. Can (or should) one define semantics better than those of  
>> the
>> >MPI_INIT and MPI_FINALIZE? MPI job starts after MPI_Init. The job  
>> ends
>> >after MPI_Finalize. What happens before and after is almost  
>> undefined.
>> >This is about all the standard specifies, and it's rather clear  
>> why: it
>> >cannot prescribe the way in which processes are started, because  
>> it's
>> >very system specific. CR is possibly even more system specific.
>> >
>> >Let's get back to the proposal:
>> >
>> >MPI_PREPARE_FOR_CHECKPOINT(MPI_COMM)    ~ MPI_FINALIZE
>> >MPI_RESTORE_AFTER_CHECKPOINT(MPI_COMM)  ~ MPI_INIT
>> >
>> >Use MPI_COMM_WORLD for global CR. Use MPI_COMM_SELF for local CR.
>> >
>> >Call the first function immediately before the checkpoint, do the
>> >checkpoint the way you like, and call the second immediately after  
>> to
>> >re-enter the MPI session where you left it.
>> >
>> >What else can be added to make this more clear and more precise than
>> >MPI_INIT and MPI_FINALIZE definitions?
>> >
>> >-----Original Message-----
>> >From: mpi3-ft-bounces at lists.mpi-forum.org
>> >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg
>> >Bronevetsky
>> >Sent: Friday, October 24, 2008 11:14 PM
>> >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working  
>> Group;
>> >MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> >Subject: Re: [Mpi3-ft] system-level C/R requirements
>> >
>> >I think that the problem for the forum will be the unclear semantics
>> >of the new calls. MPI_Init is not a good example because it has  
>> clear
>> >semantics for all users of MPI but not system-level services. The
>> >difference with the quiscence calls is that we're trying to  
>> provide a
>> >way to by-pass to regular MPI semantics and plug into the middle of
>> >MPI without precisely defining how the by-pass works. Precise
>> >semantics didn't matter for MPI_Init exactly because there has never
>> >been a way to look into the MPI implementation until now. The
>> >solution to this is to provide very loose semantics to the new calls
>> >but this just means that there will actually be no standard way to
>> >use the new calls, which is why I'm afraid the forum will not like  
>> it.
>> >
>> >I can think of only two things that we can compare these calls to.
>> >The first is the proposed performance hint API. However, this API is
>> >just about hints and may not be a good enough analogy for the rest  
>> of
>> >the forum. The other analogy is the performance profiling APIs that
>> >some MPI implementation support. These APIs allow tools to determine
>> >some statistics about internal MPI state. If that is the analogy  
>> that
>> >is drawn, then it is bad for this proposal because I don't think  
>> that
>> >the performance profiling API ever got much support because of the
>> >issues that we're discussing here.
>> >
>> >Greg Bronevetsky
>> >Post-Doctoral Researcher
>> >1028 Building 451
>> >Lawrence Livermore National Lab
>> >(925) 424-5756
>> >bronevetsky1 at llnl.gov
>> >
>> >At 02:03 PM 10/24/2008, Supalov, Alexander wrote:
>> > >Thanks. I can't speak for the whole Forum, but my impression is  
>> that
>> if
>> > >the choice will be between solving the problem of MPI and CR on  
>> one
>> > >hand, and not solving it on the other hand, a reasonable proposal
>> will
>> > >go a long way toward convincing the majority, or at least moving  
>> the
>> > >discussion to a still better proposal.
>> > >
>> > >As for the number of calls, this is question of ROI. We're going  
>> to
>> add
>> > >200 or so fancy calls by the latest guess, while here we have  
>> just 2
>> > >that offer basic functionality of undeniable value. This should be
>> > >acceptable.
>> >
>> > >Finally, I don't know a more implementation specific call than
>> >MPI_Init.
>> > >The proposed calls live close nearby.
>> >
>> >
>> >
>> >_______________________________________________
>> >mpi3-ft mailing list
>> >mpi3-ft at lists.mpi-forum.org
>> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> > 
>> ---------------------------------------------------------------------
>> >Intel GmbH
>> >Dornacher Strasse 1
>> >85622 Feldkirchen/Muenchen Germany
>> >Sitz der Gesellschaft: Feldkirchen bei Muenchen
>> >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>> >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>> >VAT Registration No.: DE129385895
>> >Citibank Frankfurt (BLZ 502 109 00) 600119052
>> >
>> >This e-mail and any attachments may contain confidential material  
>> for
>> >the sole use of the intended recipient(s). Any review or  
>> distribution
>> >by others is strictly prohibited. If you are not the intended
>> >recipient, please contact the sender and delete all copies.
>> >
>> >
>> >_______________________________________________
>> >mpi3-ft mailing list
>> >mpi3-ft at lists.mpi-forum.org
>> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> ---------------------------------------------------------------------
>> Intel GmbH
>> Dornacher Strasse 1
>> 85622 Feldkirchen/Muenchen Germany
>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>> Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>> VAT Registration No.: DE129385895
>> Citibank Frankfurt (BLZ 502 109 00) 600119052
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list