[mpi3-ft] Notes from 15 Jan. 2008 Meeting
Greg Bronevetsky
bronevetsky1 at llnl.gov
Tue Jan 22 12:32:54 CST 2008
>* Replication techniques should be added to the (presented) list of
>possible recovery techniques to consider supporting.
> - Question: Does considering replication techniques undermine the
>HPC target community of the MPI standard.
> - MPI interface should allow for this, maybe as an opt'in
>functionality.
Replication of MPI tasks? Is this really necessary? It seems to me
that if you're going to go for something this expensive, it would be
simple to add replication as a layer above MPI.
>* An interface to easily piggyback data on point-to-point (and maybe
>collective) messages would help support some high level checkpoint/
>restart techniques.
We're working on putting together a piggybacking proposal.
Checkpointing protocols are one of the use-cases.
>* Should we expose to the application the fault recovery mechanism
>used? (i.e., Network path failover)
> - As a notification once complete?
> - Or as a request to authorize the recovery mechanism before it
>executes?
We can only do that if we have a list of well-defined recovery
mechanisms that we can present in the spec. I doubt that anybody is
going to commit to such a list, so the only thing that we may be able
to get away with are classes of recovery mechanisms: network
recovery, node recovery, etc. I'm not sure what that user would be
able to do with such information. They probably mostly want to know
how long it will take and what MPI level objects will not be
recovered (ex: the state of the dead process).
>* Some users have expressed an interest in having an explicit
>checkpoint function to identify 'good' times in the code execution to
>checkpoint. Ideally when the checkpoint state is minimal.
I don't think that the MPI forum should be in the business of
defining a checkpointing API. Checkpointing is a whole-application
operation, whereas MPI is just a library. Can you imagine the chaos
that would ensue if every library decided on its own checkpointing
API? The user would need to call each library's own relevant calls
any time they wanted to do anything checkpoint-related. To keep
things sane the user must be provided by the whole-application
checkpointer with a single API that they will use for their
checkpointing needs. There is no reason to make them issue
MPI-specific checkpointing calls unless we can come up with some
MPI-specific optimizations that users must be directly involved in.
I'm skeptical about this possibility.
We could of course define a couple of functions with the explicit
intent that they should be used by the whole-application
checkpointer, rather than the user. However, most prior work on MPI
checkpointing has either modified MPI itself or has worked without
any modifications to the MPI spec, meaning that there is little
motivation to create such an API.
>* Can we ask the MPI to save some internal state around a user level
>checkpoint operation such that on recovery MPI objects such as
>datatypes or communicators are automatically recreated for the user
>before the application resumes execution.
This can be efficiently done by a layer above MPI, so I'm not sure
that such functionality needs to be in the spec.
>* Collective communication:
> - For some collectives a process can be in the collective or on
>either side of it, how do the semantics of the collective operation
>change in response to a failure/recovery operation.
> - Bcast, for example, may allow some processes to exit early
>providing some performance benefits. A fault tolerance semantic for
>Bcast may require a global synchronization at the end of the Bcast to
>ensure a two-phase commit of the effected buffers.
> - We could consider loosening the synchronization constraints such
>that the user could choose, via an MPI API, the level of confidence
>they require for a process to exit a collective call.
> - Maybe introduce the notion of epochs that can be started and
>ended to help in fault isolation.
Here's another option: if a collective fails, each receiver of the
collective either receives the data or aborts (i.e. no hangs or
invalid data). If the user is employing other fault tolerance
techniques such as checkpointing or message logging, this
functionality should be sufficient.
>* Could fault tolerance only apply to a subset of the MPI API for
>simplicity? For example, exclude collective communications or MPI I/O.
Most applications use at least point-to-points or collectives, so if
you disable fault tolerance for collectives, you disable it for most
applications.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
More information about the mpiwg-ft
mailing list