[Mpi3-ft] fault-tolerant collectives

Mon Sep 19 01:54:10 CDT 2011

Just catching up here ....
On Sep 12, 2011, at 1:40 PM, Bronis R. de Supinski wrote:

> 
> It would be useful from a scalability standpoint.
> I have not thought through how much trouble it
> might cause from a programmability standpoint.
> It just seems the inherited mechanism is useful
> in costs. It might be useful to have an attribute
> that says to return an error (or some other notification)
> if it has been validated through inheritance. Then
> the library fixes could potentially just be local
> instead of repeating the interprocess work...

I believe that inheritance could be dangerous, and is why I  suggested a list of communicators.  The origin of the validate call is the desire by apps to be able to have a place to check on the state of the binary at the stage that the call is made, so if an app has the inheritance properties wrong, they are out of luck.  Listing the ones of interest is a bit less risky.  On the other hand, I suppose we can just make it the user's responsibility to "get it right".

> 
> On Mon, 12 Sep 2011, Darius Buntinas wrote:
> 
>> Are you're saying that the app might want to validate a library's (internal) communicator for it?
>> 
>> In general, how safe would it be to do that if the library isn't expecting it?  I suppose we could do something like an inherited attribute that makes all subcommunicators automatically validate when it's parent does.
>> 
>> -d
>> 
>> On Sep 12, 2011, at 12:12 PM, Bronis R. de Supinski wrote:
>> 
>>> 
>>> Suppose some communicators are used in libraries
>>> (duplication of MPI_COMM_WORLD is very common).
>>> How would any part of the code know about all of
>>> them? Sure, you could use the profiling interface
>>> to intercept calls to track them but usual application
>>> code would not have any record of the handles...
>>> 
>>> On Mon, 12 Sep 2011, Darius Buntinas wrote:
>>> 
>>>> 
>>>> On Sep 11, 2011, at 4:45 PM, Josh Hursey wrote:
>>>> 
>>>>> On Fri, Sep 9, 2011 at 7:57 PM, Graham, Richard L. <rlgraham at ornl.gov> wrote:
>>>>>> I have been talking a reasonable amount with apps folks lately about this proposal, and they first response is often one of shock, as it is not quite what folks initially expect.  However, once one explains the background for the proposal, people tend to accept the notions.
>>>>> 
>>>>> Can you explain a bit more about what they were shocked by? Was it the
>>>>> general notion of application involved FT, or the interface not being
>>>>> what they expected/needed?

[rich] The notion of different return codes for different processes is what really surprises them.  This is no different that things are today, with the exception that the errors_abort code which is used basically 100% of the time takes care of this.  Once the motivation to minimize the performance penalty, under the assumption of  a low error rate is explained, this makes sense.

>>>>> 
>>>>> 
>>>>>> I agree that we need to define a mechanism for specifying return codes - uniform among surviving ranks, or locally determined types.  However, I do believe that we need to add the second set of collectives into 3.0.  We have mentioned this as an option for several years (actually since the inception of the group almost 4 years ago), but as a working group never did something explicit about this.  There is a reasonable number of apps folks that expect this type of collective communications.
>>>>> 
>>>>> 
>>>>> It shouldn't be to difficult to specify/add, and a prototype
>>>>> implementation would be trivial though maybe inefficient at first. Is
>>>>> this something that we should put in the Stabilization proposal or
>>>>> bring in as a separate ticket directly afterward?

[rich] The proposal needs to be sufficiently broad scoped, that it satisfies a wide range of apps, so this does need to go into the first version.  However, this is not new - these sort of collectives have been implemented before,  so I view these as a small change - only adding some sort of flag indicating the type of collectives to be used at run-time.

>>>>> 
>>>>> Separating the two keeps the initial proposal simpler and users can
>>>>> get this functionality by wrapping existing collectives in
>>>>> comm_validate calls (not efficient, but functional). Keeping them
>>>>> together allows us to address a know interface optimization that
>>>>> applications want in the first pass.
>>>> 
>>>> It seems that this can be done in the app itself (or a stand-alone utility library)

[rich] this is true about all collectives.  The apps can implement thees on their own.  The reason for including them in the standard is that they are used a lot, and are very expensive to optimize, so having library implementations that do this for all apps is beneficial here.  A similar argument holds here for the collectives.

>>>> 
>>>> int MYMPI_Bcast_ft(...) {
>>>>  while (1) {
>>>>      ret = MPI_Bcast(...);
>>>>      if (ret == MPI_ERR_RANK_FAILSTOP) {
>>>>          ret == MPI_Comm_validate(...);
>>>>          if (ret) return ret;
>>>>      } else {
>>>>          return ret;
>>>>      }
>>>>  }
>>>> }
>>>> 
>>>> ...or something like that.  If an implementation implements it's collectives in a FT manner, then the validate call can be close to a noop (maybe a simple allreduce to make sure the failed_groups are the same).
>>>> 
>>>>>> One other thing that came up yesterday (I have given 2 talks about the FT stuff in Kobe this week) is that it would be good to be able to specify multiple communicators to mpi_comm_validate(), especially, since a common motif is to dup an existing communicator to isolate communication.  This is really the only way that I can think of to avoid un-needed global communication, if more than one communicator is of interest to the app.
>>>>> 
>>>>> 
>>>>> I've had a couple applications ask about this as well.
>>>>> 
>>>>> The group has talked about such an interface a few times now, and keep
>>>>> getting stuck on specifying the interface and semantics of such an
>>>>> operation. Did they want a function that would take an array of
>>>>> communicators to validate, or have the validation of one communicator
>>>>> be inherited by all of the derived communicators?

[rich] yes, I would just extend the current definition to take in an array of communicators, and return an array of group handles.

>>>>> 
>>>>> I think the array of communicators interface seems like the easiest to
>>>>> use, and makes it easier to protect libraries. But that leads us to
>>>>> the question, do all processes (union of processes from all
>>>>> communicators specified?) have to supply the same set of
>>>>> communicators? If not, do we run the risk of a circular dependency
>>>>> causing the call to deadlock? We might be able to pass the
>>>>> responsibility to avoid such problems off to the user.
>>>> 
>>>> Yeah, I think it would be easiest to require that the same communicators are specified.  And I don't think that would be too difficult for the users to handle.
>>>> 
>>>> When we talked about "linking" communicators with attributes or something, such that when one communicator is validated, all of the linked communicators are also validated.  In that case we said that the all of the linked communicators must be subsets (or dups) of the communicator they're linked to.
>>>> 
>>>> We could have a similar "subset" restriction for the array-of-communicators-to-comm_validate option.

[rich] I don't think there is an issue here with deadlock, and there is no need to restrict the list of communicators.  Just like any collective communication, all are required to call these in the same order - validate is no different.  If the library can detect some optimization opportunities, this is great, otherwise we are no worse off than having a series of independent calls.

Rich

>>>> 
>>>>> I'm game for trying to specify this again. The group decided to push
>>>>> this off to a follow on ticket because it can be achieved (though
>>>>> inefficiently) by making a call to comm_validate for each of the
>>>>> communicators, and we had trouble specifying it correctly. So the
>>>>> question again is should be keep it as a separate ticket or add it to
>>>>> the stabilization proposal?
>>>> 
>>>> I'm kinda leaning toward adding it to the current proposal, but I don't want to delay the proposal.
>>>> -d
>>>> 
>>>> 
>>>>> Thoughts?
>>>>> 
>>>>> -- Josh
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Rich
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>>>> Sent: Friday, September 09, 2011 4:39 PM
>>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>>> Subject: Re: [Mpi3-ft] fault-tolerant collectives
>>>>>> 
>>>>>> 
>>>>>> OK, that makes sense.  I'll fix up that text.
>>>>>> 
>>>>>> Thanks,
>>>>>> -d
>>>>>> 
>>>>>> On Sep 9, 2011, at 3:36 PM, Josh Hursey wrote:
>>>>>> 
>>>>>>> I think that we want to say that an implementation may provide uniform
>>>>>>> return codes from collectives, but are not required to do so. So this
>>>>>>> makes then fault tolerant-ish - in the sense that they have to work
>>>>>>> around failure to return error codes consistently, but not that they
>>>>>>> finish the collective successfully even if new process failures emerge
>>>>>>> during the collectives (that would undermine the semantic protections
>>>>>>> we are putting in place).
>>>>>>> 
>>>>>>> We should probably not say 'fault tolerant collectives' in the current
>>>>>>> proposal so we don't confuse things. Maybe 'collectives that provide
>>>>>>> uniform return codes'?
>>>>>>> 
>>>>>>> 
>>>>>>> If we want truly fault tolerant collectives (like those described
>>>>>>> below), then I think we should introduce a different set of functions.
>>>>>>> The functions should probably return a group of processes that either
>>>>>>> did or did not participate in creating the final result. Something
>>>>>>> like:
>>>>>>> MPI_Reduce_ft(..., &group);
>>>>>>> 
>>>>>>> I think the true fault tolerant collectives should be left to a follow
>>>>>>> on ticket since there is a need, but can be easily added as a second
>>>>>>> step.
>>>>>>> 
>>>>>>> -- Josh
>>>>>>> 
>>>>>>> On Fri, Sep 9, 2011 at 4:17 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>>>>>>>> 
>>>>>>>> We discussed the option of allowing an implementation to provide fault tolerant (not just fault-aware) collectives.  The idea is that even when a process fails, collectives will continue to operate correctly (modulo the failed process).
>>>>>>>> 
>>>>>>>> Does this imply that the communicator will never become collectively inactive?
>>>>>>>> 
>>>>>>>> If no, then what's the point of ft collectives?
>>>>>>>> 
>>>>>>>> If yes, then the application may never get notification that a process has failed and collectives are now running one short.  Is this what we really want?
>>>>>>>> 
>>>>>>>> -d
>>>>>>>> _______________________________________________
>>>>>>>> mpi3-ft mailing list
>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> hxxp://users.nccs.gov/~jjhursey
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> hxxp://users.nccs.gov/~jjhursey
>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>