[Mpi3-ft] system-level C/R requirements
alexander.supalov at intel.com
Mon Oct 27 09:10:12 CDT 2008
Thanks. This would probably be comparable to supporting threads at
MPI_THREAD_SINGLE and MPI_THREAD_MULTIPLE, i.e., we would have to
introduce several CR support levels: I can imagine that some MPIs may
not be prepared to deal with asynchronous CR.
From: mpi3-ft-bounces at lists.mpi-forum.org
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Mike Heffner
Sent: Monday, October 27, 2008 3:00 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] system-level C/R requirements
I was referring to the MPI stack supporting the PREPARE and RESTORE
calls. Ideally we would like to take existing MPI applications and
support C/R support without any modifications to their source.
Similarly, if we can separate the requirements of C/R restart from the
application design this allows the developer to focus purely on the
problem their application is trying to solve and not on C/R
However, if you intend to support C/R in a way that is completely
separate from the MPI use of the application then you will have
scenarios that a checkpoint is initiated while the application is
blocking on an MPI_Recv(), MPI_Barrier(), or any other blocking
operation. Therefore, for the most value the PREPARE and RESTORE
operations should be able to be invoked asynchronously during the
application's MPI communication.
Supalov, Alexander wrote:
> Thanks. What stack do you mean here?
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org
> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Mike Heffner
> Sent: Saturday, October 25, 2008 2:48 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] system-level C/R requirements
> Supalov, Alexander wrote:
>> Thanks. I think the word "how" below is decisive.
>> The definition of MPI_Init and MPI_Finalize do not say "how"
>> are created, and still, they work. Likewise, as soon as we can define
>> the expected outcome of the proposed calls, we can offload the "how"
>> the system - in this case, the CR system.
>> Now we come to the expected outcome. Imagine we guarantee that
>> no MPI communication between the PREPARE and RESTORE calls, and no
>> messages stuck in the wire or in the buffers. What can be stored in
>> system memory covered by CR will be stored there. The rest will be
>> restored by the RESTORE call once it gets control over this memory
>> back. This may include reinitialization of the networking hardware,
>> reestablishment of connections, reopening of the files, etc.
>> What other guarantees do CR people want?
> If the stack supported these calls asynchronously during MPI
> communication -- either from a signal handler or from a second thread
> then I think that definition would go a fair way towards what would be
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the mpiwg-ft