[mpi3-ft] Notes from 15 Jan. 2008 Meeting
Josh Hursey
jjhursey at open-mpi.org
Tue Jan 22 08:46:30 CST 2008
Last week Rich asked me to take some notes from the MPI Forum
discussion on dynamic process/fault tolerance. Those notes are enclosed.
--Josh
-------------------
MPI Forum 3.0 Dynamic Process/Fault Tolerance Working Group
Brief Notes from Jan. 15, 2008 Meeting - Chicago, IL
Chapter Coordinator: Rich Graham
Notes taken by Josh Hursey
* May have to describe the state of a function across a failure (and
possible recovery) for all MPI functions and failure scenarios
* May want to consider proposing a suggestion for implementation:
- i.e.: An MPI implementation does not need to do X, but if it does
decide to support X then we recommend the following as a way of
implementing it.
* Provide support for some class (or subset) of recovery mechanisms
that may be able to be built on top of an MPI implementation of the
MPI 3.0 standard.
* Replication techniques should be added to the (presented) list of
possible recovery techniques to consider supporting.
- Question: Does considering replication techniques undermine the
HPC target community of the MPI standard.
- MPI interface should allow for this, maybe as an opt'in
functionality.
* An interface to easily piggyback data on point-to-point (and maybe
collective) messages would help support some high level checkpoint/
restart techniques.
* Should we expose to the application the fault recovery mechanism
used? (i.e., Network path failover)
- As a notification once complete?
- Or as a request to authorize the recovery mechanism before it
executes?
* Some users have expressed an interest in having an explicit
checkpoint function to identify 'good' times in the code execution to
checkpoint. Ideally when the checkpoint state is minimal.
* Can we ask the MPI to save some internal state around a user level
checkpoint operation such that on recovery MPI objects such as
datatypes or communicators are automatically recreated for the user
before the application resumes execution.
* Areas of MPI possibly affected by fault tolerance:
- Communicators such as MPI_COMM_WORLD may change size.
- MPI_Init/MPI_Finalize may need to called multiple times to
support some recovery techniques.
- MPI I/O needs to be looked at.
- MPI Topology changes that may/will occur when a process failure
and possible recovery occurs.
* Collective communication:
- For some collectives a process can be in the collective or on
either side of it, how do the semantics of the collective operation
change in response to a failure/recovery operation.
- Bcast, for example, may allow some processes to exit early
providing some performance benefits. A fault tolerance semantic for
Bcast may require a global synchronization at the end of the Bcast to
ensure a two-phase commit of the effected buffers.
- We could consider loosening the synchronization constraints such
that the user could choose, via an MPI API, the level of confidence
they require for a process to exit a collective call.
- Maybe introduce the notion of epochs that can be started and
ended to help in fault isolation.
* Use case: How can a user choose to enable or disable fault tolerance
features?
- Mostly implementation specific. (link time, compile time, ...)
* Could fault tolerance only apply to a subset of the MPI API for
simplicity? For example, exclude collective communications or MPI I/O.
* It was stated that we should express that the fault tolerance MPI
specification may not be able to provide guarantees about the
stability of some data segments or operations effected by a fault, but
that an MPI implementation will provide a stable enough environment to
allow for post recovery.
- This restriction should be explicitly stated in the standard text
presented.
* In a failure scenario is the fault mode/error reported local or
global to the communicator?
* Most people agree that this working group should step forward
towards API changes to help support, but maybe not provide, fault
tolerance.
* A mailing list will be created for this group, and will be posted to
the mpi-21 listserv.
More information about the mpiwg-ft
mailing list