[Mpi3-ft] fault-tolerant collectives

Mon Sep 12 12:12:02 CDT 2011

Suppose some communicators are used in libraries
(duplication of MPI_COMM_WORLD is very common).
How would any part of the code know about all of
them? Sure, you could use the profiling interface
to intercept calls to track them but usual application
code would not have any record of the handles...

On Mon, 12 Sep 2011, Darius Buntinas wrote:

>
> On Sep 11, 2011, at 4:45 PM, Josh Hursey wrote:
>
>> On Fri, Sep 9, 2011 at 7:57 PM, Graham, Richard L. <rlgraham at ornl.gov> wrote:
>>>  I have been talking a reasonable amount with apps folks lately about this proposal, and they first response is often one of shock, as it is not quite what folks initially expect.  However, once one explains the background for the proposal, people tend to accept the notions.
>>
>> Can you explain a bit more about what they were shocked by? Was it the
>> general notion of application involved FT, or the interface not being
>> what they expected/needed?
>>
>>
>>>  I agree that we need to define a mechanism for specifying return codes - uniform among surviving ranks, or locally determined types.  However, I do believe that we need to add the second set of collectives into 3.0.  We have mentioned this as an option for several years (actually since the inception of the group almost 4 years ago), but as a working group never did something explicit about this.  There is a reasonable number of apps folks that expect this type of collective communications.
>>
>>
>> It shouldn't be to difficult to specify/add, and a prototype
>> implementation would be trivial though maybe inefficient at first. Is
>> this something that we should put in the Stabilization proposal or
>> bring in as a separate ticket directly afterward?
>>
>> Separating the two keeps the initial proposal simpler and users can
>> get this functionality by wrapping existing collectives in
>> comm_validate calls (not efficient, but functional). Keeping them
>> together allows us to address a know interface optimization that
>> applications want in the first pass.
>
> It seems that this can be done in the app itself (or a stand-alone utility library):
>
> int MYMPI_Bcast_ft(...) {
>    while (1) {
>        ret = MPI_Bcast(...);
>        if (ret == MPI_ERR_RANK_FAILSTOP) {
>            ret == MPI_Comm_validate(...);
>            if (ret) return ret;
>        } else {
>            return ret;
>        }
>    }
> }
>
> ...or something like that.  If an implementation implements it's collectives in a FT manner, then the validate call can be close to a noop (maybe a simple allreduce to make sure the failed_groups are the same).
>
>>>  One other thing that came up yesterday (I have given 2 talks about the FT stuff in Kobe this week) is that it would be good to be able to specify multiple communicators to mpi_comm_validate(), especially, since a common motif is to dup an existing communicator to isolate communication.  This is really the only way that I can think of to avoid un-needed global communication, if more than one communicator is of interest to the app.
>>
>>
>> I've had a couple applications ask about this as well.
>>
>> The group has talked about such an interface a few times now, and keep
>> getting stuck on specifying the interface and semantics of such an
>> operation. Did they want a function that would take an array of
>> communicators to validate, or have the validation of one communicator
>> be inherited by all of the derived communicators?
>>
>> I think the array of communicators interface seems like the easiest to
>> use, and makes it easier to protect libraries. But that leads us to
>> the question, do all processes (union of processes from all
>> communicators specified?) have to supply the same set of
>> communicators? If not, do we run the risk of a circular dependency
>> causing the call to deadlock? We might be able to pass the
>> responsibility to avoid such problems off to the user.
>
> Yeah, I think it would be easiest to require that the same communicators are specified.  And I don't think that would be too difficult for the users to handle.
>
> When we talked about "linking" communicators with attributes or something, such that when one communicator is validated, all of the linked communicators are also validated.  In that case we said that the all of the linked communicators must be subsets (or dups) of the communicator they're linked to.
>
> We could have a similar "subset" restriction for the array-of-communicators-to-comm_validate option.
>
>> I'm game for trying to specify this again. The group decided to push
>> this off to a follow on ticket because it can be achieved (though
>> inefficiently) by making a call to comm_validate for each of the
>> communicators, and we had trouble specifying it correctly. So the
>> question again is should be keep it as a separate ticket or add it to
>> the stabilization proposal?
>
> I'm kinda leaning toward adding it to the current proposal, but I don't want to delay the proposal.
> -d
>
>
>> Thoughts?
>>
>> -- Josh
>>
>>
>>>
>>> Rich
>>>
>>> -----Original Message-----
>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>> Sent: Friday, September 09, 2011 4:39 PM
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> Subject: Re: [Mpi3-ft] fault-tolerant collectives
>>>
>>>
>>> OK, that makes sense.  I'll fix up that text.
>>>
>>> Thanks,
>>> -d
>>>
>>> On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
>>>
>>>> I think that we want to say that an implementation may provide uniform
>>>> return codes from collectives, but are not required to do so. So this
>>>> makes then fault tolerant-ish - in the sense that they have to work
>>>> around failure to return error codes consistently, but not that they
>>>> finish the collective successfully even if new process failures emerge
>>>> during the collectives (that would undermine the semantic protections
>>>> we are putting in place).
>>>>
>>>> We should probably not say 'fault tolerant collectives' in the current
>>>> proposal so we don't confuse things. Maybe 'collectives that provide
>>>> uniform return codes'?
>>>>
>>>>
>>>> If we want truly fault tolerant collectives (like those described
>>>> below), then I think we should introduce a different set of functions.
>>>> The functions should probably return a group of processes that either
>>>> did or did not participate in creating the final result. Something
>>>> like:
>>>>  MPI_Reduce_ft(..., &group);
>>>>
>>>> I think the true fault tolerant collectives should be left to a follow
>>>> on ticket since there is a need, but can be easily added as a second
>>>> step.
>>>>
>>>> -- Josh
>>>>
>>>> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>>>>
>>>>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>>>>>
>>>>> Does this imply that the communicator will never become collectively inactive?
>>>>>
>>>>> If no, then what's the point of ft collectives?
>>>>>
>>>>> If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?
>>>>>
>>>>> -d
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> hxxp://users.nccs.gov/~jjhursey
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>