[Mpi3-ft] Conference Call Jun, 11

Aurélien Bouteiller bouteill at icl.utk.edu
Mon Jun 10 10:00:00 CDT 2013


Dear WG members: 

According to our new schedule, we will have a confcall on TUESDAY JUNE 11, 15H00 EDT

Agenda:
* Debriefing from Forum
* Consider relaxed semantics for I/O
* Consider an additional fn to obtain the globally agreed on group of failed processes


I will not be able to attend, but I think Wesley will.



In any case, here are my personal notes of the meeting minutes. 

* I/O:
  * Mohamed expressed that he would find useful to be able to continue operating on a FILE after some error has been reported, at least to perform local, non-synchronizing operations. The issue is that we have defined the file pointer as "undefined" after a failure is reported, so it makes no sense to continue anything as it would result in incorrect behavior. 

Yet, it is difficult to offer better: with async operations, what should be the status of the file pointer when ops A B C are pending, Op C completes, op B fails, all during the same completion call ? We need to think and check with I/O People (And the I/O text to see what they define on backend errors). 

RMA: 
 * No strong issue were found by the Forum, it seems that we are progressing on track for this one. 

Coll: 
 * Thorsten has been very vocal, which spurred a lot of discussions. 
  * One of the topic was "why do you need shrink, comm_create can do it too". I explained that shrink is FT and operates on Revoked comms, which comm_create is not (at least by spec, the typical implementation would at least operate correctly on comms with errors, w/o new failures). We do not want to overload non-FT functions with FT semantic. This is bad practice and will bite us back later if we do that 
  * Yet, Jim made the comment that there is no clear way for a user to know what is the set of globally known failures, without revoking (or writing your own P2P FT all reduce on groups). We may want to consider this, at least to illustrate how one can code such a fn as a helper in the current context. 

Tickets:
 * No objections to proposed closure of tickets. 




Other progress: 
 * Text is now rebased from MPI-3 document
 * Text repo is now read-public


That's all folks! 

Aurelien

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375











More information about the mpiwg-ft mailing list