[mpi3-ft] Notes from 15 Jan. 2008 Meeting

Richard Graham rlgraham at ornl.gov
Thu Jan 31 10:01:32 CST 2008


Josh,
  Can't get this through.  Jeff will post this on the web as soon as we have
things set up, which should be next week.  Does not help now ...

Rich


On 1/30/08 11:22 AM, "Josh Hursey" <jjhursey at open-mpi.org> wrote:

> Rich,
> 
> Can you send around a pdf of the slides you presented during the MPI
> forum FT working group meeting? These slides might be helpful when
> looking through these notes.
> 
> Thanks,
> Josh
> 
> On Jan 22, 2008, at 1:32 PM, Greg Bronevetsky wrote:
> 
>> 
>>> * Replication techniques should be added to the (presented) list of
>>> possible recovery techniques to consider supporting.
>>>   - Question: Does considering replication techniques undermine the
>>> HPC target community of the MPI standard.
>>>   - MPI interface should allow for this, maybe as an opt'in
>>> functionality.
>> Replication of MPI tasks? Is this really necessary? It seems to me
>> that if you're going to go for something this expensive, it would be
>> simple to add replication as a layer above MPI.
>> 
>>> * An interface to easily piggyback data on point-to-point (and maybe
>>> collective) messages would help support some high level checkpoint/
>>> restart techniques.
>> 
>> We're working on putting together a piggybacking proposal.
>> Checkpointing protocols are one of the use-cases.
>> 
>>> * Should we expose to the application the fault recovery mechanism
>>> used? (i.e., Network path failover)
>>>   - As a notification once complete?
>>>   - Or as a request to authorize the recovery mechanism before it
>>> executes?
>> We can only do that if we have a list of well-defined recovery
>> mechanisms that we can present in the spec. I doubt that anybody is
>> going to commit to such a list, so the only thing that we may be able
>> to get away with are classes of recovery mechanisms: network
>> recovery, node recovery, etc. I'm not sure what that user would be
>> able to do with such information. They probably mostly want to know
>> how long it will take and what MPI level objects will not be
>> recovered (ex: the state of the dead process).
>> 
>>> * Some users have expressed an interest in having an explicit
>>> checkpoint function to identify 'good' times in the code execution to
>>> checkpoint. Ideally when the checkpoint state is minimal.
>> I don't think that the MPI forum should be in the business of
>> defining a checkpointing API. Checkpointing is a whole-application
>> operation, whereas MPI is just a library. Can you imagine the chaos
>> that would ensue if every library decided on its own checkpointing
>> API? The user would need to call each library's own relevant calls
>> any time they wanted to do anything checkpoint-related. To keep
>> things sane the user must be provided by the whole-application
>> checkpointer with a single API that they will use for their
>> checkpointing needs. There is no reason to make them issue
>> MPI-specific checkpointing calls unless we can come up with some
>> MPI-specific optimizations that users must be directly involved in.
>> I'm skeptical about this possibility.
>> 
>> We could of course define a couple of functions with the explicit
>> intent that they should be used by the whole-application
>> checkpointer, rather than the user. However, most prior work on MPI
>> checkpointing has either modified MPI itself or has worked without
>> any modifications to the MPI spec, meaning that there is little
>> motivation to create such an API.
>> 
>>> * Can we ask the MPI to save some internal state around a user level
>>> checkpoint operation such that on recovery MPI objects such as
>>> datatypes or communicators are automatically recreated for the user
>>> before the application resumes execution.
>> This can be efficiently done by a layer above MPI, so I'm not sure
>> that such functionality needs to be in the spec.
>> 
>>> * Collective communication:
>>>   - For some collectives a process can be in the collective or on
>>> either side of it, how do the semantics of the collective operation
>>> change in response to a failure/recovery operation.
>>>   - Bcast, for example, may allow some processes to exit early
>>> providing some performance benefits. A fault tolerance semantic for
>>> Bcast may require a global synchronization at the end of the Bcast to
>>> ensure a two-phase commit of the effected buffers.
>>>   - We could consider loosening the synchronization constraints such
>>> that the user could choose, via an MPI API, the level of confidence
>>> they require for a process to exit a collective call.
>>>   - Maybe introduce the notion of epochs that can be started and
>>> ended to help in fault isolation.
>> Here's another option: if a collective fails, each receiver of the
>> collective either receives the data or aborts (i.e. no hangs or
>> invalid data). If the user is employing other fault tolerance
>> techniques such as checkpointing or message logging, this
>> functionality should be sufficient.
>> 
>>> * Could fault tolerance only apply to a subset of the MPI API for
>>> simplicity? For example, exclude collective communications or MPI I/
>>> O.
>> Most applications use at least point-to-points or collectives, so if
>> you disable fault tolerance for collectives, you disable it for most
>> applications.
>> 
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft




More information about the mpiwg-ft mailing list