[Mpi3-ft] FTWG conference call today

Sur, Sayantan sayantan.sur at intel.com
Wed Dec 12 12:31:54 CST 2012


Hello WG members,

Josh, Darius and I were on the call. We discussed our assignment to define what happens to objects upon failure. Specifically, what happens to objects that are created locally (i.e. do not require any remote processes to call MPI), but the MPI implementation can store them in a distributed fashion.

We had a short brainstorming session. The thoughts that were discussed were:

- We could require of the implementation that after failure and when such objects are accessed, the implementation provides either SUCCESS or FAILURE, i.e. there are no corrupted or partially available objects.
- It could be that some alive ranks can read their objects, whereas others cannot.
- The app could use MPI_Comm_agree to reach consensus on whether all required objects are able to be read on ranks that are alive.
- For some objects, such as Datatype, there are no accessor functions other than when it is used (e.g. Send/recv). It is possible that an MPI implementation could return error when a datatype is used by app, but the internal representation is not available to the implementation. However, this is not very useful as the app then needs a way to discern why a send failed.
- Would it make sense to add *_Check functions to objects to see if they are still available (after failure)?

Please let me know if I missed something in the notes.

Sayantan


> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
> Sent: Wednesday, December 12, 2012 6:33 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: [Mpi3-ft] FTWG conference call today
> 
> Dear working group members,
> 
> We have our usual biweekly conference call planned for today.
> Unfortunately, nobody from UT is available to attend, but the conference call
> will be setup and available to the group anyway.
> 
> We would appreciate if somebody could keep a summary of discussions.
> 
> 
> Agenda:
> - Followup items from the Meeting
> 
> 
> Date: Dec 12, 2012
> Time: Noon EDT/New York
> Dial-in information: 218-339-4600
> Code: 623998#
> 
> 
> Next Meeting:
> * Jan. 9, 2013
> 
> 
> Please note: Dec. 26 date has been cancelled.
> 
> 
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list