[mpiwg-sessions] Sessions WG - meet 1/29/24

Holmes, Daniel John daniel.john.holmes at intel.com
Thu Feb 1 13:52:56 CST 2024


Hi Ralph,

Thanks for the question. I’m not sure I fully see the difficulty.


  1.  Many failures in one “event” means that when we go around the while loop and try to rebuild, we discover a whole bunch of failures all at once. The MPI_Session_get_failed procedure gives a big group with size >>1. We exclude all of those processes in one group difference call and we’re back to a simple case. I’m assuming the detectors are well-behaved and all of them detect the same event and notify the same failures. The timing might not be perfect, so some detectors might give a big group, whereas others give a small group. That will lead to some processes discovering many failures during the operation – that’s the next case.
  2.  Many failures being detected during the operation definitely slows it down, but I don’t see any kind of undecidable or livelock/deadlock outcome. It looks like O(P log(P)) is a sensible worst-case asymptotic, even when the failures are happening one at a time at just the wrong moments. One event taking out a bunch of processes at the same time seems to be an easier case than that. Eventually every node of the spanning tree will be traversed, either directly or by proxy via another live process, so all the failures would be discovered by all live processes. The agree catches the case where live processes fail after they’ve finished collaborating with the other processes (when they believe they have complete information but before they return to the user).
Every failed involved process will be discovered by a live involved process unless all the involved processes that believed the failed involved process was involved in the operation fail before or during the operation; in which case, it no longer matters. The survivors eventually come to consensus on the membership of the survivors group by excluding any and all processes that the survivors cannot unanimously agree are alive and involved.

The key rule here is that “if any process believes that procF has failed, then all other processes must accept that decision and treat procF as failed also” which is conservative but fits with the process fail-stop model.

The only serious problem I can foresee is bifurcation: two or more closed compact graphs of processes that are disjoint, separately coming to consensus that they are the only survivors because every link between those disjoint groups has been severed by process failures. We discussed a similar situation (network bifurcation) at length in a previous meeting, but there was no consensus on the correct response from MPI. One school of thought permits all the disjoint groups to carry on, oblivious of the others. Another school of thought dictates that every group should attempt to destroy all the others to ensure that it really is the only surviving group. Neither outcome is ideal – so let’s pretend bifurcation does not happen! We can hope to get away with that unsavoury decision for the network bifurcation case because “network bifurcation” is not a process fail-stop fault and is, therefore, out of scope for ULFM. However, the bifurcation because of a set of strategically placed process failures is definitely in scope and just as tricky.

Do you see additional problematic situations?

Best wishes,
Dan.

From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org> On Behalf Of Ralph Castain via mpiwg-sessions
Sent: Thursday, February 1, 2024 7:13 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org>
Cc: Ralph Castain <rhc at pmix.org>
Subject: Re: [mpiwg-sessions] Sessions WG - meet 1/29/24

Just curious - how do you figure to handle the cascade-of-failure scenario? This is by far the most common as you either lose a node (which means the detection of individual failure for each proc on that node) or you lose a switch (and therefore get report of individual failure for all procs communicating over that device). Kind of rare for a proc to just die on its own (except for contrived tests, of course), though not impossible.

Ralph



On Feb 1, 2024, at 10:02 AM, Holmes, Daniel John via mpiwg-sessions <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>> wrote:

Hi all,

WORDS to capture our discussion on Monday 29th Jan.

The basic idea in the example code I gave is to detect process fail-stop faults, shrink the active communicator to exclude the failed processes, and then carry on.

The while loop captures the "carry on" part, only exit when we hit the break statement (or we suffer a process fail-stop fault).
The additional code block captures the "shrink" part by creating a new communicator that excludes failed processes.

The MPI_Group_from_session_pset calls will all produce identical groups, even after processes have failed. This is a local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.
The MPI_Session_get_proc_failed calls give a snapshot of the knowledge contained in the local detector. There is no communication or nonlocal dependence here. This is a (new) local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.
The group manipulation procedures are existing MPI and will work exactly the same as they always have done. These are local procedures and there's no reason for them to return an error. We should guarantee that they never return errors of class MPI_ERR_PROC_FAILED.

The devils are always in the details. The details in this case are all in the implementation of MPI_Comm_create_from_group.

The MPI_Comm_create_from_group procedure must handle some difficult cases:
1. the potential for failed processes to exist in the group that is passed in by the calling MPI process
2. the potential for the group passed in to be different to the groups passed in by other MPI processes
3. the potential for the additional MPI processes to fail during the operation

1. failed before the operation
If this MPI process attempts to communicate with a failed process, it's detector must eventually detect that, otherwise it is broken!
This means that the failure of the other MPI process is discovered during the operation -- see point (3).

2. different groups at different processes
There are several sub-cases here:
a) procA's group includes procB but procB has failed (any time before procA communicates with it)
b) procA's group contains procB but procB's does not include procA
c) procA's group contains procB and procC but procB's group includes procA but does not include procC
d) procA's group and procB's group are identical but one of them later discovers an (a|b|c) problem
Current shrink implementations rely on shared knowledge of the survivors so each involved process can independently create the same spanning tree as all the other involved processes.
This simplifies the design and results in a better worst-case performance scaling, at least asymptotically [ED: pls check]
All of the above sub-cases can happen during a shrink operation, so intuitively this is no worse than that.
We have two broad categories of approach here:
- any problem results in an error, return immediately without creating the output communicator
- any FT problem is dealt with internally; some kind of best-effort communicator is created
We probably want a resilient algorithm that doesn’t have dreadful scaling.

3. failures during the operation
We can just do an agreement at the end to catch failures that happen after the critical moment when other processes communicate with it.

Resilient algorithm:
Start optimistic: Create a spanning tree from the group members you have been given in the call.
Be eager to help: Always listen for incoming protocol messages related to communicator creation.
Attempt communication with your direct children;
  if new failure detected, update the local group, update the local detector, fix the spanning tree (skip that child, but add that child's children as your direct children)
  if a child reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)
Attempt communication with your parent;
  if new failure detected, update the local group, update the local detector, fix the spanning tree (skip the parent, but add that parent's parent as your direct parent)
  if the parent reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)
Assume it all worked beautifully but execute MPI_Comm_agree to make certain.
  if new failure detected, update the local group, update the local detector, start again
  if the agreement succeeds without discovering new failures, we're done!

Best-case asymptotic scaling is O(log(p)), a spanning tree traversal of P processes.
Worst-case O(sum_i=1,p-1(log(p-i)) < O(p log(p)), no more than p-1 tree traversals with a shrinking tree size.

Best wishes,
Dan.

From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org<mailto:mpiwg-sessions-bounces at lists.mpi-forum.org>> On Behalf Of Holmes, Daniel John via mpiwg-sessions
Sent: Monday, January 29, 2024 4:13 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>; Aurelien Bouteiller <bouteill at icl.utk.edu<mailto:bouteill at icl.utk.edu>>
Cc: Holmes, Daniel John <daniel.john.holmes at intel.com<mailto:daniel.john.holmes at intel.com>>
Subject: Re: [mpiwg-sessions] Sessions WG - meet 1/29/24

Hi Howard/all,

Here is the simple code I was talking about in the meeting today:

// general high-level optimistic application
void main() {

     MPI_Session session;
     MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &session);
     MPI_Group group;
     MPI_Group_from_session_pset(session, "mpi://world", &group);
     MPI_Comm comm;
     MPI_Comm_create_from_group(group, &comm);

     ret = do_stuff_with_comm(comm);

     if (MPI_SUCCESS == ret) {
           MPI_Comm_disconnect(&comm);
           MPI_Session_Finalize(&session);
           break;

     } else {
           panic();

     }
}

// general high-level pragmatic application
void main() {

     // additional code
     while (1) {

     MPI_Session session;
     MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &session);
     MPI_Group world, failed, group;
     MPI_Group_from_session_pset(session, "mpi://world", &world);

     // additional code
     MPI_Session_get_proc_failed(session, &failed); // new API, seems easy to do
     MPI_Group_difference(world, failed, &group);
     MPI_Group_free(&world);
     MPI_Group_free(&failed);

     MPI_Comm comm;
     MPI_Comm_create_from_group(group, &comm); // <-- the detail-devils live here
     MPI_Group_free(&group);

     ret = do_stuff_with_comm(comm);

     MPI_Comm_disconnect(&comm);
     MPI_Session_Finalize(&session);

     if (MPI_SUCCESS == ret) {
           break; // all done!

     } else if (MPI_ERR_PROC_FAILED == ret) {
           continue; // no more panic

     }

     // additional code
     } // end while
}


Best wishes,
Dan.


From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org<mailto:mpiwg-sessions-bounces at lists.mpi-forum.org>> On Behalf Of Pritchard Jr., Howard via mpiwg-sessions
Sent: Thursday, January 25, 2024 6:07 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>
Cc: Pritchard Jr., Howard <howardp at lanl.gov<mailto:howardp at lanl.gov>>
Subject: [mpiwg-sessions] Sessions WG - meet 1/29/24

Hi Folks,

Let’s meet on 1/29 to continue discussions related to sessions and FT.

I think what will help is to consider several use cases and implications.

Here are some I have

  *   App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group,  but also wants to use methods available in slice1 of ULFM proposal to shrink/repair communicators.  Are there any problems?
  *   App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group, and wants to use methods available in slice 1 of ULFM proposal to create new group from a pset and create a new communicator
  *   App using sessions to init/finalize, etc. and when a fail-stop error is detected destroy the session, create a new session query for process sets, etc. and start all over.

We should also consider the behavior of MPI_Comm_create_from_group if a process failure occurs while creating a new communicator.  The ULFM slice 1 discusses behavior of MPI_COMM_DUP and process failure.  We’d probably want similar behavior for MPI_Comm_create_from_group.

For those with access, the ULFM slice 1 PR is at https://github.com/mpi-forum/mpi-standard/pull/947

Thanks,

Howard

-------

<image001.png>
Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howardp at lanl.gov<mailto:howardp at lanl.gov>

<image002.png><https://www.instagram.com/losalamosnatlab/><image003.png><https://twitter.com/LosAlamosNatLab><image004.png><https://www.linkedin.com/company/los-alamos-national-laboratory/><image005.png><https://www.facebook.com/LosAlamosNationalLab/>


_______________________________________________
mpiwg-sessions mailing list
mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/78b96bb7/attachment-0001.html>


More information about the mpiwg-sessions mailing list