[Mpi3-ft] Run-Through Stabilization

Joshua Hursey jjhursey at open-mpi.org
Tue Aug 24 16:01:28 CDT 2010

I recently started to generate a proposal to support applications that wish to run-through process fail-stop failures. This is an attempt to pull apart the current FT proposals into two camps: stabilization, and recovery. The run-through stabilization proposal is meant to provide the foundation for and a complement to an eventual recovery proposal.

The run-through stabilization proposal aggregates much of the discussion in the working group so far regarding error management and validation of communicators. I modified some of the interfaces based on some recent experimentation with the interfaces. I have also been reading through various parts of the standard to find semantics, interfaces, and discussions that will need to be modified in order to support run-though stabilization.

The current draft on the wiki is rough and in development. I have a few more notes and changes to make, but I wanted to circulate the current state before the teleconf tomorrow. This way I can introduce the proposal and we can start discussing some of the concepts. My hope is that this proposal helps spark some more discussion in the working group.

The proposal is linked off of the main FT working group wiki, and is at the link below:

Concurrently with the development of the proposal text, I am working on a prototype implementation in Open MPI that should help guide the process of refining the proposal.

-- Josh

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list