[Mpi3-ft] fault-tolerant collectives

Josh Hursey jjhursey at open-mpi.org
Fri Sep 9 15:36:21 CDT 2011


I think that we want to say that an implementation may provide uniform
return codes from collectives, but are not required to do so. So this
makes then fault tolerant-ish - in the sense that they have to work
around failure to return error codes consistently, but not that they
finish the collective successfully even if new process failures emerge
during the collectives (that would undermine the semantic protections
we are putting in place).

We should probably not say 'fault tolerant collectives' in the current
proposal so we don't confuse things. Maybe 'collectives that provide
uniform return codes'?


If we want truly fault tolerant collectives (like those described
below), then I think we should introduce a different set of functions.
The functions should probably return a group of processes that either
did or did not participate in creating the final result. Something
like:
  MPI_Reduce_ft(..., &group);

I think the true fault tolerant collectives should be left to a follow
on ticket since there is a need, but can be easily added as a second
step.

-- Josh

On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>
> Does this imply that the communicator will never become collectively inactive?
>
> If no, then what's the point of ft collectives?
>
> If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?
>
> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list