[mpi3-ft] Notes from 15 Jan. 2008 Meeting

Tue Jan 22 12:32:54 CST 2008

>* Replication techniques should be added to the (presented) list of
>possible recovery techniques to consider supporting.
>    - Question: Does considering replication techniques undermine the
>HPC target community of the MPI standard.
>    - MPI interface should allow for this, maybe as an opt'in
>functionality.
Replication of MPI tasks? Is this really necessary? It seems to me 
that if you're going to go for something this expensive, it would be 
simple to add replication as a layer above MPI.

>* An interface to easily piggyback data on point-to-point (and maybe
>collective) messages would help support some high level checkpoint/
>restart techniques.

We're working on putting together a piggybacking proposal. 
Checkpointing protocols are one of the use-cases.

>* Should we expose to the application the fault recovery mechanism
>used? (i.e., Network path failover)
>    - As a notification once complete?
>    - Or as a request to authorize the recovery mechanism before it
>executes?
We can only do that if we have a list of well-defined recovery 
mechanisms that we can present in the spec. I doubt that anybody is 
going to commit to such a list, so the only thing that we may be able 
to get away with are classes of recovery mechanisms: network 
recovery, node recovery, etc. I'm not sure what that user would be 
able to do with such information. They probably mostly want to know 
how long it will take and what MPI level objects will not be 
recovered (ex: the state of the dead process).

>* Some users have expressed an interest in having an explicit
>checkpoint function to identify 'good' times in the code execution to
>checkpoint. Ideally when the checkpoint state is minimal.
I don't think that the MPI forum should be in the business of 
defining a checkpointing API. Checkpointing is a whole-application 
operation, whereas MPI is just a library. Can you imagine the chaos 
that would ensue if every library decided on its own checkpointing 
API? The user would need to call each library's own relevant calls 
any time they wanted to do anything checkpoint-related. To keep 
things sane the user must be provided by the whole-application 
checkpointer with a single API that they will use for their 
checkpointing needs. There is no reason to make them issue 
MPI-specific checkpointing calls unless we can come up with some 
MPI-specific optimizations that users must be directly involved in. 
I'm skeptical about this possibility.

We could of course define a couple of functions with the explicit 
intent that they should be used by the whole-application 
checkpointer, rather than the user. However, most prior work on MPI 
checkpointing has either modified MPI itself or has worked without 
any modifications to the MPI spec, meaning that there is little 
motivation to create such an API.

>* Can we ask the MPI to save some internal state around a user level
>checkpoint operation such that on recovery MPI objects such as
>datatypes or communicators are automatically recreated for the user
>before the application resumes execution.
This can be efficiently done by a layer above MPI, so I'm not sure 
that such functionality needs to be in the spec.

>* Collective communication:
>    - For some collectives a process can be in the collective or on
>either side of it, how do the semantics of the collective operation
>change in response to a failure/recovery operation.
>    - Bcast, for example, may allow some processes to exit early
>providing some performance benefits. A fault tolerance semantic for
>Bcast may require a global synchronization at the end of the Bcast to
>ensure a two-phase commit of the effected buffers.
>    - We could consider loosening the synchronization constraints such
>that the user could choose, via an MPI API, the level of confidence
>they require for a process to exit a collective call.
>    - Maybe introduce the notion of epochs that can be started and
>ended to help in fault isolation.
Here's another option: if a collective fails, each receiver of the 
collective either receives the data or aborts (i.e. no hangs or 
invalid data). If the user is employing other fault tolerance 
techniques such as checkpointing or message logging, this 
functionality should be sufficient.

>* Could fault tolerance only apply to a subset of the MPI API for
>simplicity? For example, exclude collective communications or MPI I/O.
Most applications use at least point-to-points or collectives, so if 
you disable fault tolerance for collectives, you disable it for most 
applications.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov