[Mpi3-ft] fault-tolerant collectives

Josh Hursey jjhursey at open-mpi.org
Fri Sep 9 15:53:04 CDT 2011


Thanks. I think I was the one that introduced the bad wording into the
text. :/ Thanks for catching it.

On Fri, Sep 9, 2011 at 4:39 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> OK, that makes sense.  I'll fix up that text.
>
> Thanks,
> -d
>
> On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
>
>> I think that we want to say that an implementation may provide uniform
>> return codes from collectives, but are not required to do so. So this
>> makes then fault tolerant-ish - in the sense that they have to work
>> around failure to return error codes consistently, but not that they
>> finish the collective successfully even if new process failures emerge
>> during the collectives (that would undermine the semantic protections
>> we are putting in place).
>>
>> We should probably not say 'fault tolerant collectives' in the current
>> proposal so we don't confuse things. Maybe 'collectives that provide
>> uniform return codes'?
>>
>>
>> If we want truly fault tolerant collectives (like those described
>> below), then I think we should introduce a different set of functions.
>> The functions should probably return a group of processes that either
>> did or did not participate in creating the final result. Something
>> like:
>>  MPI_Reduce_ft(..., &group);
>>
>> I think the true fault tolerant collectives should be left to a follow
>> on ticket since there is a need, but can be easily added as a second
>> step.
>>
>> -- Josh
>>
>> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>>
>>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>>>
>>> Does this imply that the communicator will never become collectively inactive?
>>>
>>> If no, then what's the point of ft collectives?
>>>
>>> If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?
>>>
>>> -d
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list