[mpiwg-sessions] Next meeting - *Tuesday* next week - midday Eastern US time

Ralph H Castain rhc at open-mpi.org
Thu Aug 9 10:42:34 CDT 2018



> On Aug 9, 2018, at 7:53 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk> wrote:
> 
> Hi Ralph,
> 
> I particularly like the self-fulfilling paradox at the end there: if the group construct fails then I won’t even call group construct, which guarantees that it’ll fail. Of course it’s not really a paradox because there are multiple processes involved but I can see why your head is hurting.

Yes, it is easy to rapidly get trapped in the weeds here.

> 
> Taking a further step back, scalable group construction (without a priori knowledge of membership and with the possibility of errors) requires a distributed graph specification of the membership and handling of errors. If each process only specifies its neighbours, and only handles errors from its neighbours, then the event explosion can be contained to a neighbourhood.
> 
> For example, if all processes specified the group membership as {back_proc, forward_proc} then a failed process will cause a NOTIFY_TERMINATION event in its back_proc and its forward_proc but no others, irrespective of the total size of the intended group. Those two processes must then decide what to do, e.g. link to each other (form a ring with one fewer processes than originally intended) or invite a new process (hopefully the same one ;) which will fix the originally intended ring). A “burn the world” decision from either of these two processes must then be propagated to all other participants (not necessarily around the ring, an OOB method works too). None of the other processes can have left/completed their group construct yet because not all members have given their consent (implies: connect up local neighbourhood, wait for go signal, complete locally).
> 
> If the user chooses to specify all other processes at all processes, then they are explicitly requesting to be notified about all events for which they have registered from all other processes. This may cause explosions for large groups.
> 
> Note, the user does not actually have to specify the intended size of the group when constructing it. The size and topology of the (connected portion of the) graph can be determined once the group exists, by doing collective operations using the group. Collective operations can be done on the entire (connected portion of the) graph by each process interacting only with its neighbours. Having discovered information about processes outside of one’s local neighbourhood, direct communication can be done between any pair or sub-group of processes. Thus, a fully connected MPI communicator can (if required) be built up from a MPI_DIST_GRAPH topology specification.

I’m reluctant to customize PMIx Groups for just the graph use-case as other programming models also have interest (and might not fit that case) and it isn’t clear to me that everyone in MPI will want to base themselves on a graph-based group approach . What we could do, though, is provide attributes to support the use model you describe and let the PMIx Group implementation deal with it as a special case.

For the more general case, I think we just need to ensure that there is a clean failure path that ensures the user gets out of the operation (i.e., doesn’t hang or incorrectly think the group exists) when failures occur. We can provide failure notification and recovery methods - we just need to acknowledge that these only really work in the (expected) case where failures are relatively rare events. After all, if lots of processes are failing or refusing to join the proposed group during a construct operation, then you probably need to do some triage on your cluster and/or your application!

If we take that approach, then we can limit notifications to the “leader” and let it decide what to do about it. If the leader fails, then we could just have PMIx automatically terminate group construction, issuing “cancel” events to all other participants.

For flexibility, we can add an attribute that modifies that behavior and add a new event to notify other group participants of the leader’s failure (we know the leader already agreed to join the group!). We can then add an attribute by which a process can declare itself the new leader, thereby causing an event to the rest of the group participants to update their leader assignment (this is implemented today as a broadcast and so scales relatively well). The new leader is the one that will decide what to do about giving up on constructing the group.

Since we cache notifications, we know that any “cancel” event received by a proc prior to registering for it will still be delivered. We then specify in the standard that procs should register for all group-related events prior to engaging in any PMIx Group operations. This ensures that the app knows about the “cancel” before calling construct, and that procs which call the blocking form of construct prior to the event arriving will still have a mechanism for getting out of the operation.

To make the registration easier, PMIx could add an ability to register for a “class” of events - e.g., register for the “group” class of events. This would provide for future compatibility should new group-related events get added. You currently have to specify the events you want to know about.

Make sense?
Ralph

> 
> Cheers,
> Dan.
>> Dr Daniel Holmes PhD
> Applications Consultant in HPC Research
> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>> 
>> On 9 Aug 2018, at 15:00, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>> 
>> So let’s step back for a moment at look at the failed construct problem in more depth. There are a couple of issues that come to mind.
>> 
>> First, do we really want to send the NOTIFY_TERMINATION event to all participants, or just to the leader? If the group under construction is large and we see a number of failures, then we could wind up in an event “storm”. If we alert only the leader, then it begs the question: what if the leader is the one who fails? Do we need a mechanism by which someone else can declare themselves to be the “leader”?
>> 
>> It isn’t too difficult for us to examine the returned results array from an event handler, though I’d want to (a) generalize it a bit so we can (b) limit how much of that we do to avoid making the event notification code explode with special cases. If we go that route, which seems the right thing to me, then.we again have a couple of choices:
>> 
>> * if the NOTIFY_TERMINATION event is only going to the leader, then we would provide the ability for the leader to declare “burn the world” and send a corresponding event to all participants. It does create a bit of a race condition as a remote participating proc may get the event prior to calling Group_join and thus (a) has no idea what the event is talking about and (b) would have to retain the cancellation notice pending the call to Group_join so it could return an error. Doable - just a tad tricky and difficult to test that race condition
>> 
>> * if the event goes to all participants, then they could locally decide to abandon the group. If they have already joined, they could leave. If not, then they could simply decline the invite. Again, there are race conditions that could bite us (particularly in multi-threaded apps), but maybe we resolve some of those by imposing requirements on the app.
>> 
>> Now that was all based on the async construct - but what do we do about a blocking call to PMIx_Group_construct? Only think I can think of would be to provide an attribute in the results array that tells the PMIx library to “kick me out of the current operation” and includes some tag(s) to indicate what operation it is talking about. We actually talked about that at some length during the last in-person PMIx devel meeting and came up with a scheme to support such a request (hasn’t been implemented yet), so this could work. However, it again creates that race condition for procs that receive the TERMINATION event prior to calling “construct” as the operation hasn’t been initiated yet. I guess we could just put the burden on the app to realize that it got a group_construct termination event and should therefore not call “construct” on that group?
>> 
>> My head is beginning to hurt and I’ve probably confused folks anyway, so best to stop here and wait for input.
>> Ralph
>> 
>> 
>>> On Aug 9, 2018, at 2:26 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>> wrote:
>>> 
>>> Hi Ralph,
>>> 
>>> Thanks for the updates, I’ll take a look in a moment (after coffee).
>>> 
>>> Technically, a cancel API in PMIX *could* be used with a blocking group construction by calling the cancel inside an event callback. For example, an app could keep a count of how many times PMIX_GROUP_NOTIFY_TERMINATION was called for this group construction and invite a new process for the first N times but cancel the operation in response to any subsequent event(s).
>>> 
>>> MPI is trying to get rid of its cancel API - we deprecated cancel for point-to-point send because it is fundamentally broken. However, cancel for point-to-point receive is still valid and useful. The problem with cancel is always the race between the “all is OK, go ahead” and the “whoa, stop that” signals. With a receive in MPI, the choice of which will succeed (receive or cancel) is always a local decision and can therefore be made atomic and consistent. The choice between send and cancel is a distributed decision, which cannot be atomic, and always suffers from a race. One way to avoid this race in a PMIX group cancel would be specify that it is only valid from within an event callback, e.g. by exposing it as an in-out/by-ref parameter in the callback itself (not as a separate function call). PMIX could then examine this parameter (set by the user, during the event callback) when the callback returns. It’s a binary/boolean choice between “I handled it, carry on” and “I panicked, burn the world”.
>>> 
>>> Is it useful? Well all systems have finite resources to use as replacements, so eventually this operation must fail. This type of cancel allows the application to choose when to give up based on how many things happened rather than just how many seconds elapsed.
>>> 
>>> Cheers,
>>> Dan.
>>>>>> Dr Daniel Holmes PhD
>>> Applications Consultant in HPC Research
>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>> Phone: +44 (0) 131 651 3465
>>> Mobile: +44 (0) 7940 524 088
>>> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>>> 
>>>> On 9 Aug 2018, at 03:54, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>>>> 
>>>> I have updated the web page to reflect the comments. Let me know what you think and about the “cancel” API.
>>>> 
>>>>> On Aug 8, 2018, at 5:40 PM, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Aug 8, 2018, at 4:14 PM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>> wrote:
>>>>>> 
>>>>>> Hi Ralph,
>>>>>> 
>>>>>> I was nodding whilst reading,
>>>>> 
>>>>> Great!
>>>>> 
>>>>>> until I got to PMIX_GROUP_LEAVE_REQUEST (Destruct procedure, PMIx_Groups).
>>>>>> 
>>>>>> There are situations where a process will leave without the luxury of requesting and being patient first, e.g. faults (handled as termination, I know, bear with me a moment). If this event was instead PMIX_GROUP_LEFT, then processes would be written to be able to cope with sudden exits in other processes. They have to be written like that anyway because of PMIX_GROUP_NOTIFY_TERMINATION. This event simply distinguishes "the process called PMIx_GROUP_LEAVE" from "the RM figured out a process stopped executing (normally or abnormally)”. Is such a distinction useful?
>>>>> 
>>>>> It’s a good point. My thought was to provide a “clean” way of dynamically leaving a group as opposed to just pulling out. On the other hand, we do need apps to be prepared for unexpected termination - so it isn’t clear that there is any real benefit. I have no issue with making this change.
>>>>> 
>>>>>> 
>>>>>> In terms of outstanding/in-progress collective operations, just state that calling PMIX_GROUP_LEAVE is not allowed unless no such operations are in flight.
>>>>> 
>>>>> I think putting a requirement that no collective op can be in progress is unenforceable, especially if you take the position that leaving is the same as unexpected termination - i.e., programs need to be written in a way that can adapt to terminations or departures. We can provide an event indicating that departure occurred and user apps need to register for it and decide for themselves how to respond if in a user collective. The PMIx server can adjust any ongoing PMIx collective (e.g., PMIx_Fence) without user intervention. We currently error-out from such operations, but we can provide an attribute to indicate the operation should “self-heal” and proceed to completion.
>>>>> 
>>>>> 
>>>>>> The potential race between a process calling PMIX_GROUP_LEAVE and other process(es) in the group starting a collective operation should not happen in a well-defined program. Also, if PMIX_GROUP_NOTIFY_TERMINATION can state "collective operations will be adjusted appropriately" then why can’t PMIX_GROUP_LEAVE say that too?
>>>>> 
>>>>> No problem - it can certainly do so.
>>>>> 
>>>>>> 
>>>>>>>>>>>> 
>>>>>> For PMIX_GROUP_JOIN, can the leader process give up creation of the group and somehow tell PMIX to stop trying?
>>>>> 
>>>>> Sure - it can do so by setting the PMIX_TIMEOUT attribute. We could provide a “cancel” API as well, but that would require that you used the non-blocking form of PMIx_Group_construct as otherwise there would be no way to call it. Would a “cancel” API be of benefit?
>>>>> 
>>>>>> If so, then processes that accepted a join request should be informed that the group is never going to be constructed, i.e. they should stop waiting for the callback/return of the blocking function. Thus, "once the group has been completely constructed” could be tempered with “or the group construction fails”.
>>>>> 
>>>>> Agreed - will update.
>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> Dan.
>>>>>>>>>>>> Dr Daniel Holmes PhD
>>>>>> Applications Consultant in HPC Research
>>>>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>>>>> Phone: +44 (0) 131 651 3465
>>>>>> Mobile: +44 (0) 7940 524 088
>>>>>> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>>>>>>>>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>>>>>>>>> 
>>>>>>> On 8 Aug 2018, at 23:04, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>>>>>>> 
>>>>>>> Hi folks
>>>>>>> 
>>>>>>> I have updated the PMIx Group web page to capture the discussion of the prior meeting plus some subsequent thoughts:
>>>>>>> 
>>>>>>> https://pmix.org/pmix-standard/pmix-groups/ <https://pmix.org/pmix-standard/pmix-groups/>
>>>>>>> 
>>>>>>> I’ll try to put some initial implementation behind it before the meeting, so please feel free to chime up with any thoughts.
>>>>>>> Ralph
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 6, 2018, at 11:38 AM, Ralph H Castain <rhc at open-mpi.org <mailto:rhc at open-mpi.org>> wrote:
>>>>>>>> 
>>>>>>>> Looks like I can free some time up this week for groups - will try to update later this week 
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>> On Aug 6, 2018, at 11:05 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>> wrote:to
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> The next meeting for the Sessions WG will be *Tuesday 14th Aug 2018* at 12pm Eastern US time.
>>>>>>>>> 
>>>>>>>>> Note the change of day and time. This is a one-off change due to vacation time.
>>>>>>>>> 
>>>>>>>>> The connection details for the call will be sent out on this list nearer the time.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> Dan.
>>>>>>>>>>>>>>>>>> Dr Daniel Holmes PhD
>>>>>>>>> Applications Consultant in HPC Research
>>>>>>>>> d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>
>>>>>>>>> Phone: +44 (0) 131 651 3465
>>>>>>>>> Mobile: +44 (0) 7940 524 088
>>>>>>>>> Address: Room 3415, JCMB, The King’s Buildings, Edinburgh, EH9 3FD
>>>>>>>>>>>>>>>>>> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>> _______________________________________________
>>>>>>>>> mpiwg-sessions mailing list
>>>>>>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> mpiwg-sessions mailing list
>>>>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>>>>> 
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>> _______________________________________________
>>>>>> mpiwg-sessions mailing list
>>>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>>>> 
>>>>> _______________________________________________
>>>>> mpiwg-sessions mailing list
>>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>>> 
>>>> _______________________________________________
>>>> mpiwg-sessions mailing list
>>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>>> 
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> _______________________________________________
>>> mpiwg-sessions mailing list
>>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions>
>> 
>> _______________________________________________
>> mpiwg-sessions mailing list
>> mpiwg-sessions at lists.mpi-forum.org <mailto:mpiwg-sessions at lists.mpi-forum.org>
>> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions
> 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> _______________________________________________
> mpiwg-sessions mailing list
> mpiwg-sessions at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20180809/15f320b3/attachment-0001.html>


More information about the mpiwg-sessions mailing list