[Mpi3-ft] Some remarks from the presentation at the MPI forum

Aurélien Bouteiller bouteill at eecs.utk.edu
Tue Jul 19 13:47:46 CDT 2011


Here are some of the informal comments I'd like to make regarding the 274 proposition presented at the MPI forum. 


== Why do you propose duplicate functions for groups and communicator operations ? Some proposed to remove the comms, but as Josh correctly stated, the fact that it is attached on a per-communicator basis is a key feature to enable software layer composition. On the other end, I don't see a clear identified need for group operations. The only thing that is enabled by group operations is to validate a group with which you cannot communicate (no communicator with it). Also, group operations are not homogenous with the rest, they don't have a set (because that would be harmful). I think these should be removed from the proposal, except if some clear need is identified.  

== Section 17.6.2::19 should be linked from the Finalize definition, where the example of a code doing IO on rank 0 after finalize is presented. That would keep all FT related considerations in the FT chapter, at the exception of a warning to users that could read like "beware that when considering a fault tolerant application, rank 0 might be dead, in which case the result of finalize is defined in 17.6.2."

== Why the ordering on lowest rank return code if rank 0 is dead ? The purpose is unclear, but the cost on the implementation of Finalize is real. We could just specify that the return value from any of the surviving processes (if any) can be taken. It should define what happens when nobody calls abort, rank 0 is dead, and finalize does not return on ranks>0 (valid by the standard). 

== MPI_Kill: might go in another ticket about soft errors. The current ticket is about fail-stop, this is the only thing about soft errors, hence I see it to be out of scope. It is a polarizing controversy, that might need significant rework in the broader context of byzantine errors. I advocate it is postponed. 

== Ordering of reduce operations may change across ranks, after Validate_all recognized a failure, and more generally collective trees are modified. 

That's all folks!

aurelien
--
* Dr. Aurélien Bouteiller
* Research Scientist at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 350
* Knoxville, TN 37996
* 865 974 6321








More information about the mpiwg-ft mailing list