[mpi3-ft] Notes from 15 Jan. 2008 Meeting

Tue Jan 22 08:46:30 CST 2008

Last week Rich asked me to take some notes from the MPI Forum  
discussion on dynamic process/fault tolerance. Those notes are enclosed.

--Josh

-------------------
MPI Forum 3.0 Dynamic Process/Fault Tolerance Working Group
Brief Notes from Jan. 15, 2008 Meeting - Chicago, IL
Chapter Coordinator: Rich Graham
Notes taken by Josh Hursey

* May have to describe the state of a function across a failure (and  
possible recovery) for all MPI functions and failure scenarios

* May want to consider proposing a suggestion for implementation:
   - i.e.: An MPI implementation does not need to do X, but if it does  
decide to support X then we recommend the following as a way of  
implementing it.

* Provide support for some class (or subset) of recovery mechanisms  
that may be able to be built on top of an MPI implementation of the  
MPI 3.0 standard.

* Replication techniques should be added to the (presented) list of  
possible recovery techniques to consider supporting.
   - Question: Does considering replication techniques undermine the  
HPC target community of the MPI standard.
   - MPI interface should allow for this, maybe as an opt'in  
functionality.

* An interface to easily piggyback data on point-to-point (and maybe  
collective) messages would help support some high level checkpoint/ 
restart techniques.

* Should we expose to the application the fault recovery mechanism  
used? (i.e., Network path failover)
   - As a notification once complete?
   - Or as a request to authorize the recovery mechanism before it  
executes?

* Some users have expressed an interest in having an explicit  
checkpoint function to identify 'good' times in the code execution to  
checkpoint. Ideally when the checkpoint state is minimal.

* Can we ask the MPI to save some internal state around a user level  
checkpoint operation such that on recovery MPI objects such as  
datatypes or communicators are automatically recreated for the user  
before the application resumes execution.

* Areas of MPI possibly affected by fault tolerance:
   - Communicators such as MPI_COMM_WORLD may change size.
   - MPI_Init/MPI_Finalize may need to called multiple times to  
support some recovery techniques.
   - MPI I/O needs to be looked at.
   - MPI Topology changes that may/will occur when a process failure  
and possible recovery occurs.

* Collective communication:
   - For some collectives a process can be in the collective or on  
either side of it, how do the semantics of the collective operation  
change in response to a failure/recovery operation.
   - Bcast, for example, may allow some processes to exit early  
providing some performance benefits. A fault tolerance semantic for  
Bcast may require a global synchronization at the end of the Bcast to  
ensure a two-phase commit of the effected buffers.
   - We could consider loosening the synchronization constraints such  
that the user could choose, via an MPI API, the level of confidence  
they require for a process to exit a collective call.
   - Maybe introduce the notion of epochs that can be started and  
ended to help in fault isolation.

* Use case: How can a user choose to enable or disable fault tolerance  
features?
   - Mostly implementation specific. (link time, compile time, ...)

* Could fault tolerance only apply to a subset of the MPI API for  
simplicity? For example, exclude collective communications or MPI I/O.

* It was stated that we should express that the fault tolerance MPI  
specification may not be able to provide guarantees about the  
stability of some data segments or operations effected by a fault, but  
that an MPI implementation will provide a stable enough environment to  
allow for post recovery.
   - This restriction should be explicitly stated in the standard text  
presented.

* In a failure scenario is the fault mode/error reported local or  
global to the communicator?

* Most people agree that this working group should step forward  
towards API changes to help support, but maybe not provide, fault  
tolerance.

* A mailing list will be created for this group, and will be posted to  
the mpi-21 listserv.