[Mpi3-ft] fault-tolerant collectives

Josh Hursey jjhursey at open-mpi.org
Wed Sep 14 13:32:55 CDT 2011

Note: I started a new thread on the multiple MPI_Comm_validate
operation, since it is a divergence in this thread.

For fault tolerant collectives:

So how about a function signature like:
  MPI_*_ft(..., fail_grp)

So postfix the function with '_ft' to differentiate them from the
traditional collectives.

Semantically the fault tolerant collectives will (to the greatest
possible extent) work around existing and emerging process failures to
complete the operation successfully. Upon successful completion, all
processes are returned a consistent version of 'fail_grp' containing
those failed processes that did not contribute to the completion of
the collective. If the collective is not successful, then all
processes will be returned some error code.

So function signatures like:
MPI_Barrier_ft(..., fail_grp);
MPI_Bcast_ft(..., fail_grp);
MPI_Gather{v}_ft(..., fail_grp);
MPI_Scatter{v}_ft(..., fail_grp);
MPI_Allgather{v}_ft(..., fail_grp);
MPI_Alltoall{v|w}_ft(..., fail_grp);
MPI_Reduce_ft(..., fail_grp);
MPI_Allreduce_ft(..., fail_grp);
MPI_Reduce_scatter_ft(..., fail_grp);
MPI_Reduce_scatter_block_ft(..., fail_grp);
MPI_Scan_ft(..., fail_grp);
MPI_Exscan_ft(..., fail_grp);
+ nonblocking versions of each.

Note that MPI_Comm_validate() and MPI_Barrier_ft() would
algorithmically behave exactly the same, except that the former would
're-enable traditional collectives'.

Some questions:
 * Is MPI_Barrier_ft() useful to provide?

 * Should we require that the communicator be collectively active
before calling these collectives (like we do for traditional
collectives)? Since they can work around process failure, we might
want to consider lifting this restriction.

 * Do we want to introduce an object like MPI_GROUP_IGNORE (similar to
MPI_STATUS_IGNORE) that allows a user to ignore the fail_grp
parameter? For example a user calling allreduce may not care which
processes failed as long as a result was reached uniformly.

 * Are there any constraints on the output buffers for these
collectives? For example, a reduce that must work around failures
might need some clarification about the numerical stability of the
output. And MPI_Op functions might be called multiple times for the
same reduction depending if the collective needs to be re-executed to
provide the necessary semantics.

 * The traditional collectives do not return uniformly, and are only
required to be fault aware. These new collectives must return
uniformly, and be fault tolerant. Do we want the ability to have
traditional collectives return uniformly, but only be required to be
fault aware? Something like a half-step between traditional
fault-aware collectives and fully fault tolerant collectives. We allow
for this in an 'advice to implementors', but do we need it to be an
explicit capability?


-- Josh

On Mon, Sep 12, 2011 at 1:40 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
> It would be useful from a scalability standpoint.
> I have not thought through how much trouble it
> might cause from a programmability standpoint.
> It just seems the inherited mechanism is useful
> in costs. It might be useful to have an attribute
> that says to return an error (or some other notification)
> if it has been validated through inheritance. Then
> the library fixes could potentially just be local
> instead of repeating the interprocess work...
> On Mon, 12 Sep 2011, Darius Buntinas wrote:
>> Are you're saying that the app might want to validate a library's
>> (internal) communicator for it?
>> In general, how safe would it be to do that if the library isn't expecting
>> it?  I suppose we could do something like an inherited attribute that makes
>> all subcommunicators automatically validate when it's parent does.
>> -d
>> On Sep 12, 2011, at 12:12 PM, Bronis R. de Supinski wrote:
>>> Suppose some communicators are used in libraries
>>> (duplication of MPI_COMM_WORLD is very common).
>>> How would any part of the code know about all of
>>> them? Sure, you could use the profiling interface
>>> to intercept calls to track them but usual application
>>> code would not have any record of the handles...
>>> On Mon, 12 Sep 2011, Darius Buntinas wrote:
>>>> On Sep 11, 2011, at 4:45 PM, Josh Hursey wrote:
>>>>> On Fri, Sep 9, 2011 at 7:57 PM, Graham, Richard L. <rlgraham at ornl.gov>
>>>>> wrote:
>>>>>> I have been talking a reasonable amount with apps folks lately about
>>>>>> this proposal, and they first response is often one of shock, as it is not
>>>>>> quite what folks initially expect.  However, once one explains the
>>>>>> background for the proposal, people tend to accept the notions.
>>>>> Can you explain a bit more about what they were shocked by? Was it the
>>>>> general notion of application involved FT, or the interface not being
>>>>> what they expected/needed?
>>>>>> I agree that we need to define a mechanism for specifying return codes
>>>>>> - uniform among surviving ranks, or locally determined types.  However, I do
>>>>>> believe that we need to add the second set of collectives into 3.0.  We have
>>>>>> mentioned this as an option for several years (actually since the inception
>>>>>> of the group almost 4 years ago), but as a working group never did something
>>>>>> explicit about this.  There is a reasonable number of apps folks that expect
>>>>>> this type of collective communications.
>>>>> It shouldn't be to difficult to specify/add, and a prototype
>>>>> implementation would be trivial though maybe inefficient at first. Is
>>>>> this something that we should put in the Stabilization proposal or
>>>>> bring in as a separate ticket directly afterward?
>>>>> Separating the two keeps the initial proposal simpler and users can
>>>>> get this functionality by wrapping existing collectives in
>>>>> comm_validate calls (not efficient, but functional). Keeping them
>>>>> together allows us to address a know interface optimization that
>>>>> applications want in the first pass.
>>>> It seems that this can be done in the app itself (or a stand-alone
>>>> utility library):
>>>> int MYMPI_Bcast_ft(...) {
>>>>  while (1) {
>>>>      ret = MPI_Bcast(...);
>>>>      if (ret == MPI_ERR_RANK_FAILSTOP) {
>>>>          ret == MPI_Comm_validate(...);
>>>>          if (ret) return ret;
>>>>      } else {
>>>>          return ret;
>>>>      }
>>>>  }
>>>> }
>>>> ...or something like that.  If an implementation implements it's
>>>> collectives in a FT manner, then the validate call can be close to a noop
>>>> (maybe a simple allreduce to make sure the failed_groups are the same).
>>>>>> One other thing that came up yesterday (I have given 2 talks about the
>>>>>> FT stuff in Kobe this week) is that it would be good to be able to specify
>>>>>> multiple communicators to mpi_comm_validate(), especially, since a common
>>>>>> motif is to dup an existing communicator to isolate communication.  This is
>>>>>> really the only way that I can think of to avoid un-needed global
>>>>>> communication, if more than one communicator is of interest to the app.
>>>>> I've had a couple applications ask about this as well.
>>>>> The group has talked about such an interface a few times now, and keep
>>>>> getting stuck on specifying the interface and semantics of such an
>>>>> operation. Did they want a function that would take an array of
>>>>> communicators to validate, or have the validation of one communicator
>>>>> be inherited by all of the derived communicators?
>>>>> I think the array of communicators interface seems like the easiest to
>>>>> use, and makes it easier to protect libraries. But that leads us to
>>>>> the question, do all processes (union of processes from all
>>>>> communicators specified?) have to supply the same set of
>>>>> communicators? If not, do we run the risk of a circular dependency
>>>>> causing the call to deadlock? We might be able to pass the
>>>>> responsibility to avoid such problems off to the user.
>>>> Yeah, I think it would be easiest to require that the same communicators
>>>> are specified.  And I don't think that would be too difficult for the users
>>>> to handle.
>>>> When we talked about "linking" communicators with attributes or
>>>> something, such that when one communicator is validated, all of the linked
>>>> communicators are also validated.  In that case we said that the all of the
>>>> linked communicators must be subsets (or dups) of the communicator they're
>>>> linked to.
>>>> We could have a similar "subset" restriction for the
>>>> array-of-communicators-to-comm_validate option.
>>>>> I'm game for trying to specify this again. The group decided to push
>>>>> this off to a follow on ticket because it can be achieved (though
>>>>> inefficiently) by making a call to comm_validate for each of the
>>>>> communicators, and we had trouble specifying it correctly. So the
>>>>> question again is should be keep it as a separate ticket or add it to
>>>>> the stabilization proposal?
>>>> I'm kinda leaning toward adding it to the current proposal, but I don't
>>>> want to delay the proposal.
>>>> -d
>>>>> Thoughts?
>>>>> -- Josh
>>>>>> Rich
>>>>>> -----Original Message-----
>>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org
>>>>>> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>>>> Sent: Friday, September 09, 2011 4:39 PM
>>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>>> Subject: Re: [Mpi3-ft] fault-tolerant collectives
>>>>>> OK, that makes sense.  I'll fix up that text.
>>>>>> Thanks,
>>>>>> -d
>>>>>> On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
>>>>>>> I think that we want to say that an implementation may provide
>>>>>>> uniform
>>>>>>> return codes from collectives, but are not required to do so. So this
>>>>>>> makes then fault tolerant-ish - in the sense that they have to work
>>>>>>> around failure to return error codes consistently, but not that they
>>>>>>> finish the collective successfully even if new process failures
>>>>>>> emerge
>>>>>>> during the collectives (that would undermine the semantic protections
>>>>>>> we are putting in place).
>>>>>>> We should probably not say 'fault tolerant collectives' in the
>>>>>>> current
>>>>>>> proposal so we don't confuse things. Maybe 'collectives that provide
>>>>>>> uniform return codes'?
>>>>>>> If we want truly fault tolerant collectives (like those described
>>>>>>> below), then I think we should introduce a different set of
>>>>>>> functions.
>>>>>>> The functions should probably return a group of processes that either
>>>>>>> did or did not participate in creating the final result. Something
>>>>>>> like:
>>>>>>> MPI_Reduce_ft(..., &group);
>>>>>>> I think the true fault tolerant collectives should be left to a
>>>>>>> follow
>>>>>>> on ticket since there is a need, but can be easily added as a second
>>>>>>> step.
>>>>>>> -- Josh
>>>>>>> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas
>>>>>>> <buntinas at mcs.anl.gov> wrote:
>>>>>>>> We discussed the option of allowing an implementation to provide
>>>>>>>> fault tolerant (not just fault-aware) collectives.  The idea is that even
>>>>>>>> when a process fails, collectives will continue to operate correctly (modulo
>>>>>>>> the failed process).
>>>>>>>> Does this imply that the communicator will never become collectively
>>>>>>>> inactive?
>>>>>>>> If no, then what's the point of ft collectives?
>>>>>>>> If yes, then the application may never get notification that a
>>>>>>>> process has failed and collectives are now running one short.  Is this what
>>>>>>>> we really want?
>>>>>>>> -d
>>>>>>>> _______________________________________________
>>>>>>>> mpi3-ft mailing list
>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> hxxp://users.nccs.gov/~jjhursey
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> http://users.nccs.gov/~jjhursey
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list