[Mpi3-ft] Fault Tolerance (sub)Chapter or Tighter Integration

Joshua Hursey jjhursey at open-mpi.org
Tue Mar 1 08:38:49 CST 2011

We start edging toward a final draft of the run-through stabilization proposal and embark on process recovery (TBA). As we do so, I wanted to start thinking about how we might integrate this language into the current MPI standard. A PDF version of the working proposal will make it easier for someone new to pick up and read exactly what we are going to add. This is in contrast to the mixture of notes and standard text that is currently on the wiki.

In particular, should we:
 A) Create an entirely new chapter on Fault Tolerance and Error Management. Pull in all existing section to a central location.
 B) Add a section to the Environmental Management chapter on Fault Tolerance. Pull in relevant existing sections on error handling into this section.
 C) Tightly integrate the semantics throughout the MPI standard (e.g., P2P semantics in the P2P chapter, Collective semantics in the Collectives chapter).
 D) Something else...

There are pros and cons to each. In essence the question is, should we move all the error management logic to a central location or keep it close to the actual functionality?

What do folks think about this?

-- Josh

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list