[Mpi3-ft] Fault Tolerance (sub)Chapter or Tighter Integration

Graham, Richard L. rlgraham at ornl.gov
Wed Mar 2 11:39:36 CST 2011

Just to raise one point - I don't plan to bring this to the full forum
until we are further ahead on the implementation with respect to the
fault-recovery.  This seemed to be the preference last time we brought the
fault-detection proposal to the full group.  There is also merit on
getting further along with the fault recovery to see if there are any
items missed.


On 3/2/11 12:31 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:

>I think a section in the environmental management chapter would make
>sense.  Then we wouldn't need additional text in the Point to Point
>chapter for things like MPI_Send and MPI_Recv, but places where
>additional explanation is needed (perhaps collectives?) we would add in
>those chapters.
>Though I would be OK with making it a chapter too.  We should then move
>Error Handling from the Environmental Management there too. (I think
>that's what you said)
>On Mar 1, 2011, at 8:38 AM, Joshua Hursey wrote:
>> We start edging toward a final draft of the run-through stabilization
>>proposal and embark on process recovery (TBA). As we do so, I wanted to
>>start thinking about how we might integrate this language into the
>>current MPI standard. A PDF version of the working proposal will make it
>>easier for someone new to pick up and read exactly what we are going to
>>add. This is in contrast to the mixture of notes and standard text that
>>is currently on the wiki.
>> In particular, should we:
>> A) Create an entirely new chapter on Fault Tolerance and Error
>>Management. Pull in all existing section to a central location.
>> B) Add a section to the Environmental Management chapter on Fault
>>Tolerance. Pull in relevant existing sections on error handling into
>>this section.
>> C) Tightly integrate the semantics throughout the MPI standard (e.g.,
>>P2P semantics in the P2P chapter, Collective semantics in the
>>Collectives chapter).
>> D) Something else...
>> There are pros and cons to each. In essence the question is, should we
>>move all the error management logic to a central location or keep it
>>close to the actual functionality?
>> What do folks think about this?
>> -- Josh
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org

More information about the mpiwg-ft mailing list