[Mpi3-ft] fault-tolerant collectives
jjhursey at open-mpi.org
Fri Sep 9 15:53:04 CDT 2011
Thanks. I think I was the one that introduced the bad wording into the
text. :/ Thanks for catching it.
On Fri, Sep 9, 2011 at 4:39 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> OK, that makes sense. I'll fix up that text.
> On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
>> I think that we want to say that an implementation may provide uniform
>> return codes from collectives, but are not required to do so. So this
>> makes then fault tolerant-ish - in the sense that they have to work
>> around failure to return error codes consistently, but not that they
>> finish the collective successfully even if new process failures emerge
>> during the collectives (that would undermine the semantic protections
>> we are putting in place).
>> We should probably not say 'fault tolerant collectives' in the current
>> proposal so we don't confuse things. Maybe 'collectives that provide
>> uniform return codes'?
>> If we want truly fault tolerant collectives (like those described
>> below), then I think we should introduce a different set of functions.
>> The functions should probably return a group of processes that either
>> did or did not participate in creating the final result. Something
>> MPI_Reduce_ft(..., &group);
>> I think the true fault tolerant collectives should be left to a follow
>> on ticket since there is a need, but can be easily added as a second
>> -- Josh
>> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives. The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>>> Does this imply that the communicator will never become collectively inactive?
>>> If no, then what's the point of ft collectives?
>>> If yes, then the application may never get notification that a process has failed and collectives are now running one short. Is this what we really want?
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft