[Mpi3-ft] Fault Tolerance (sub)Chapter or Tighter Integration
jjhursey at open-mpi.org
Thu Mar 3 11:16:58 CST 2011
Yeah, that is along the lines I was thinking. It might be useful to do a plenary session to introduce the forum to the general concepts that we are playing with at the moment, and get some initial feedback. Maybe introduce the run-through stabilization proposal after we start to get a feel for the recovery proposal. But not necessary a plenary session as a means to a first reading. I think we need to have a good feel for the recovery semantics (implementation, application uses, ...) before we start thinking about preparing for a reading/vote.
My intention with this email was just to get a feeling if folks have started thinking about what a standard PDF would look like with these additions. A clean PDF version of the developing proposal might be useful for someone new to see just the changes to the standard we are presenting, without all of our notes as with the wiki. I don't know if any of us have the cycles to start putting together the PDF, but if we did I was just curious if there was a preference on how to integrate the text.
I kind of like the idea of adding a section to the Environmental Management chapter, then putting pointers to this chapter throughout the document as necessary. But I don't have a strong feeling one way or the other.
On Mar 2, 2011, at 12:39 PM, Graham, Richard L. wrote:
> Just to raise one point - I don't plan to bring this to the full forum
> until we are further ahead on the implementation with respect to the
> fault-recovery. This seemed to be the preference last time we brought the
> fault-detection proposal to the full group. There is also merit on
> getting further along with the fault recovery to see if there are any
> items missed.
> On 3/2/11 12:31 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>> I think a section in the environmental management chapter would make
>> sense. Then we wouldn't need additional text in the Point to Point
>> chapter for things like MPI_Send and MPI_Recv, but places where
>> additional explanation is needed (perhaps collectives?) we would add in
>> those chapters.
>> Though I would be OK with making it a chapter too. We should then move
>> Error Handling from the Environmental Management there too. (I think
>> that's what you said)
>> On Mar 1, 2011, at 8:38 AM, Joshua Hursey wrote:
>>> We start edging toward a final draft of the run-through stabilization
>>> proposal and embark on process recovery (TBA). As we do so, I wanted to
>>> start thinking about how we might integrate this language into the
>>> current MPI standard. A PDF version of the working proposal will make it
>>> easier for someone new to pick up and read exactly what we are going to
>>> add. This is in contrast to the mixture of notes and standard text that
>>> is currently on the wiki.
>>> In particular, should we:
>>> A) Create an entirely new chapter on Fault Tolerance and Error
>>> Management. Pull in all existing section to a central location.
>>> B) Add a section to the Environmental Management chapter on Fault
>>> Tolerance. Pull in relevant existing sections on error handling into
>>> this section.
>>> C) Tightly integrate the semantics throughout the MPI standard (e.g.,
>>> P2P semantics in the P2P chapter, Collective semantics in the
>>> Collectives chapter).
>>> D) Something else...
>>> There are pros and cons to each. In essence the question is, should we
>>> move all the error management logic to a central location or keep it
>>> close to the actual functionality?
>>> What do folks think about this?
>>> -- Josh
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft