[Mpi3-ft] fault-tolerant collectives

Darius Buntinas buntinas at mcs.anl.gov
Fri Sep 9 15:17:13 CDT 2011


We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).

Does this imply that the communicator will never become collectively inactive?

If no, then what's the point of ft collectives?

If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?

-d



More information about the mpiwg-ft mailing list