[mpi3-ft] Notes from 15 Jan. 2008 Meeting

Josh Hursey jjhursey at open-mpi.org
Wed Jan 30 10:22:21 CST 2008


Can you send around a pdf of the slides you presented during the MPI  
forum FT working group meeting? These slides might be helpful when  
looking through these notes.


On Jan 22, 2008, at 1:32 PM, Greg Bronevetsky wrote:

>> * Replication techniques should be added to the (presented) list of
>> possible recovery techniques to consider supporting.
>>   - Question: Does considering replication techniques undermine the
>> HPC target community of the MPI standard.
>>   - MPI interface should allow for this, maybe as an opt'in
>> functionality.
> Replication of MPI tasks? Is this really necessary? It seems to me
> that if you're going to go for something this expensive, it would be
> simple to add replication as a layer above MPI.
>> * An interface to easily piggyback data on point-to-point (and maybe
>> collective) messages would help support some high level checkpoint/
>> restart techniques.
> We're working on putting together a piggybacking proposal.
> Checkpointing protocols are one of the use-cases.
>> * Should we expose to the application the fault recovery mechanism
>> used? (i.e., Network path failover)
>>   - As a notification once complete?
>>   - Or as a request to authorize the recovery mechanism before it
>> executes?
> We can only do that if we have a list of well-defined recovery
> mechanisms that we can present in the spec. I doubt that anybody is
> going to commit to such a list, so the only thing that we may be able
> to get away with are classes of recovery mechanisms: network
> recovery, node recovery, etc. I'm not sure what that user would be
> able to do with such information. They probably mostly want to know
> how long it will take and what MPI level objects will not be
> recovered (ex: the state of the dead process).
>> * Some users have expressed an interest in having an explicit
>> checkpoint function to identify 'good' times in the code execution to
>> checkpoint. Ideally when the checkpoint state is minimal.
> I don't think that the MPI forum should be in the business of
> defining a checkpointing API. Checkpointing is a whole-application
> operation, whereas MPI is just a library. Can you imagine the chaos
> that would ensue if every library decided on its own checkpointing
> API? The user would need to call each library's own relevant calls
> any time they wanted to do anything checkpoint-related. To keep
> things sane the user must be provided by the whole-application
> checkpointer with a single API that they will use for their
> checkpointing needs. There is no reason to make them issue
> MPI-specific checkpointing calls unless we can come up with some
> MPI-specific optimizations that users must be directly involved in.
> I'm skeptical about this possibility.
> We could of course define a couple of functions with the explicit
> intent that they should be used by the whole-application
> checkpointer, rather than the user. However, most prior work on MPI
> checkpointing has either modified MPI itself or has worked without
> any modifications to the MPI spec, meaning that there is little
> motivation to create such an API.
>> * Can we ask the MPI to save some internal state around a user level
>> checkpoint operation such that on recovery MPI objects such as
>> datatypes or communicators are automatically recreated for the user
>> before the application resumes execution.
> This can be efficiently done by a layer above MPI, so I'm not sure
> that such functionality needs to be in the spec.
>> * Collective communication:
>>   - For some collectives a process can be in the collective or on
>> either side of it, how do the semantics of the collective operation
>> change in response to a failure/recovery operation.
>>   - Bcast, for example, may allow some processes to exit early
>> providing some performance benefits. A fault tolerance semantic for
>> Bcast may require a global synchronization at the end of the Bcast to
>> ensure a two-phase commit of the effected buffers.
>>   - We could consider loosening the synchronization constraints such
>> that the user could choose, via an MPI API, the level of confidence
>> they require for a process to exit a collective call.
>>   - Maybe introduce the notion of epochs that can be started and
>> ended to help in fault isolation.
> Here's another option: if a collective fails, each receiver of the
> collective either receives the data or aborts (i.e. no hangs or
> invalid data). If the user is employing other fault tolerance
> techniques such as checkpointing or message logging, this
> functionality should be sufficient.
>> * Could fault tolerance only apply to a subset of the MPI API for
>> simplicity? For example, exclude collective communications or MPI I/ 
>> O.
> Most applications use at least point-to-points or collectives, so if
> you disable fault tolerance for collectives, you disable it for most
> applications.
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft

More information about the mpiwg-ft mailing list