[mpi3-ft] Notes from 15 Jan. 2008 Meeting

Thu Jan 31 09:54:02 CST 2008

resending

On 1/30/08 11:22 AM, "Josh Hursey" <jjhursey at open-mpi.org> wrote:

> Rich,
> 
> Can you send around a pdf of the slides you presented during the MPI
> forum FT working group meeting? These slides might be helpful when
> looking through these notes.
> 
> Thanks,
> Josh
> 
> On Jan 22, 2008, at 1:32 PM, Greg Bronevetsky wrote:
> 
>> 
>>> * Replication techniques should be added to the (presented) list of
>>> possible recovery techniques to consider supporting.
>>>   - Question: Does considering replication techniques undermine the
>>> HPC target community of the MPI standard.
>>>   - MPI interface should allow for this, maybe as an opt'in
>>> functionality.
>> Replication of MPI tasks? Is this really necessary? It seems to me
>> that if you're going to go for something this expensive, it would be
>> simple to add replication as a layer above MPI.
>> 
>>> * An interface to easily piggyback data on point-to-point (and maybe
>>> collective) messages would help support some high level checkpoint/
>>> restart techniques.
>> 
>> We're working on putting together a piggybacking proposal.
>> Checkpointing protocols are one of the use-cases.
>> 
>>> * Should we expose to the application the fault recovery mechanism
>>> used? (i.e., Network path failover)
>>>   - As a notification once complete?
>>>   - Or as a request to authorize the recovery mechanism before it
>>> executes?
>> We can only do that if we have a list of well-defined recovery
>> mechanisms that we can present in the spec. I doubt that anybody is
>> going to commit to such a list, so the only thing that we may be able
>> to get away with are classes of recovery mechanisms: network
>> recovery, node recovery, etc. I'm not sure what that user would be
>> able to do with such information. They probably mostly want to know
>> how long it will take and what MPI level objects will not be
>> recovered (ex: the state of the dead process).
>> 
>>> * Some users have expressed an interest in having an explicit
>>> checkpoint function to identify 'good' times in the code execution to
>>> checkpoint. Ideally when the checkpoint state is minimal.
>> I don't think that the MPI forum should be in the business of
>> defining a checkpointing API. Checkpointing is a whole-application
>> operation, whereas MPI is just a library. Can you imagine the chaos
>> that would ensue if every library decided on its own checkpointing
>> API? The user would need to call each library's own relevant calls
>> any time they wanted to do anything checkpoint-related. To keep
>> things sane the user must be provided by the whole-application
>> checkpointer with a single API that they will use for their
>> checkpointing needs. There is no reason to make them issue
>> MPI-specific checkpointing calls unless we can come up with some
>> MPI-specific optimizations that users must be directly involved in.
>> I'm skeptical about this possibility.
>> 
>> We could of course define a couple of functions with the explicit
>> intent that they should be used by the whole-application
>> checkpointer, rather than the user. However, most prior work on MPI
>> checkpointing has either modified MPI itself or has worked without
>> any modifications to the MPI spec, meaning that there is little
>> motivation to create such an API.
>> 
>>> * Can we ask the MPI to save some internal state around a user level
>>> checkpoint operation such that on recovery MPI objects such as
>>> datatypes or communicators are automatically recreated for the user
>>> before the application resumes execution.
>> This can be efficiently done by a layer above MPI, so I'm not sure
>> that such functionality needs to be in the spec.
>> 
>>> * Collective communication:
>>>   - For some collectives a process can be in the collective or on
>>> either side of it, how do the semantics of the collective operation
>>> change in response to a failure/recovery operation.
>>>   - Bcast, for example, may allow some processes to exit early
>>> providing some performance benefits. A fault tolerance semantic for
>>> Bcast may require a global synchronization at the end of the Bcast to
>>> ensure a two-phase commit of the effected buffers.
>>>   - We could consider loosening the synchronization constraints such
>>> that the user could choose, via an MPI API, the level of confidence
>>> they require for a process to exit a collective call.
>>>   - Maybe introduce the notion of epochs that can be started and
>>> ended to help in fault isolation.
>> Here's another option: if a collective fails, each receiver of the
>> collective either receives the data or aborts (i.e. no hangs or
>> invalid data). If the user is employing other fault tolerance
>> techniques such as checkpointing or message logging, this
>> functionality should be sufficient.
>> 
>>> * Could fault tolerance only apply to a subset of the MPI API for
>>> simplicity? For example, exclude collective communications or MPI I/
>>> O.
>> Most applications use at least point-to-points or collectives, so if
>> you disable fault tolerance for collectives, you disable it for most
>> applications.
>> 
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ft_full_intro.ppt
Type: application/x-mspowerpoint
Size: 257536 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20080131/2311ba95/attachment-0001.bin>