[Mpi3-ft] fault-tolerant collectives
Darius Buntinas
buntinas at mcs.anl.gov
Fri Sep 9 15:39:09 CDT 2011
OK, that makes sense. I'll fix up that text.
Thanks,
-d
On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
> I think that we want to say that an implementation may provide uniform
> return codes from collectives, but are not required to do so. So this
> makes then fault tolerant-ish - in the sense that they have to work
> around failure to return error codes consistently, but not that they
> finish the collective successfully even if new process failures emerge
> during the collectives (that would undermine the semantic protections
> we are putting in place).
>
> We should probably not say 'fault tolerant collectives' in the current
> proposal so we don't confuse things. Maybe 'collectives that provide
> uniform return codes'?
>
>
> If we want truly fault tolerant collectives (like those described
> below), then I think we should introduce a different set of functions.
> The functions should probably return a group of processes that either
> did or did not participate in creating the final result. Something
> like:
> MPI_Reduce_ft(..., &group);
>
> I think the true fault tolerant collectives should be left to a follow
> on ticket since there is a need, but can be easily added as a second
> step.
>
> -- Josh
>
> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>
>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives. The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>>
>> Does this imply that the communicator will never become collectively inactive?
>>
>> If no, then what's the point of ft collectives?
>>
>> If yes, then the application may never get notification that a process has failed and collectives are now running one short. Is this what we really want?
>>
>> -d
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list