[Mpi3-ft] system-level C/R requirements
Mike Heffner
mike.heffner at librato.com
Mon Oct 27 08:59:33 CDT 2008
I was referring to the MPI stack supporting the PREPARE and RESTORE
calls. Ideally we would like to take existing MPI applications and
support C/R support without any modifications to their source.
Similarly, if we can separate the requirements of C/R restart from the
application design this allows the developer to focus purely on the
problem their application is trying to solve and not on C/R requirements.
However, if you intend to support C/R in a way that is completely
separate from the MPI use of the application then you will have
scenarios that a checkpoint is initiated while the application is
blocking on an MPI_Recv(), MPI_Barrier(), or any other blocking
operation. Therefore, for the most value the PREPARE and RESTORE
operations should be able to be invoked asynchronously during the
application's MPI communication.
Mike
Supalov, Alexander wrote:
> Thanks. What stack do you mean here?
>
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org
> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Mike Heffner
> Sent: Saturday, October 25, 2008 2:48 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] system-level C/R requirements
>
> Supalov, Alexander wrote:
>> Thanks. I think the word "how" below is decisive.
>>
>> The definition of MPI_Init and MPI_Finalize do not say "how" processes
>> are created, and still, they work. Likewise, as soon as we can define
>> the expected outcome of the proposed calls, we can offload the "how"
> to
>> the system - in this case, the CR system.
>>
>> Now we come to the expected outcome. Imagine we guarantee that there's
>> no MPI communication between the PREPARE and RESTORE calls, and no
>> messages stuck in the wire or in the buffers. What can be stored in
> the
>> system memory covered by CR will be stored there. The rest will be
>> restored by the RESTORE call once it gets control over this memory
> image
>> back. This may include reinitialization of the networking hardware,
>> reestablishment of connections, reopening of the files, etc.
>>
>> What other guarantees do CR people want?
>>
>
> If the stack supported these calls asynchronously during MPI
> communication -- either from a signal handler or from a second thread --
>
> then I think that definition would go a fair way towards what would be
> required.
>
>
> Mike
>
More information about the mpiwg-ft
mailing list