[Mpi3-ft] fault-tolerant collectives

Darius Buntinas buntinas at mcs.anl.gov
Fri Sep 9 15:39:09 CDT 2011


OK, that makes sense.  I'll fix up that text.

Thanks,
-d

On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:

> I think that we want to say that an implementation may provide uniform
> return codes from collectives, but are not required to do so. So this
> makes then fault tolerant-ish - in the sense that they have to work
> around failure to return error codes consistently, but not that they
> finish the collective successfully even if new process failures emerge
> during the collectives (that would undermine the semantic protections
> we are putting in place).
> 
> We should probably not say 'fault tolerant collectives' in the current
> proposal so we don't confuse things. Maybe 'collectives that provide
> uniform return codes'?
> 
> 
> If we want truly fault tolerant collectives (like those described
> below), then I think we should introduce a different set of functions.
> The functions should probably return a group of processes that either
> did or did not participate in creating the final result. Something
> like:
>  MPI_Reduce_ft(..., &group);
> 
> I think the true fault tolerant collectives should be left to a follow
> on ticket since there is a need, but can be easily added as a second
> step.
> 
> -- Josh
> 
> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>> 
>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>> 
>> Does this imply that the communicator will never become collectively inactive?
>> 
>> If no, then what's the point of ft collectives?
>> 
>> If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?
>> 
>> -d
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list