[mpiwg-sessions] [EXTERNAL] Re: Sessions WG - meet 1/29/24

Thu Feb 1 13:35:28 CST 2024

Thanks Dan.  I added this to the 01/29/24 webex notes

https://github.com/mpiwg-sessions/sessions-issues/wiki/2024-01-29-webex

Howard

From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org> on behalf of MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org>
Reply-To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org>
Date: Thursday, February 1, 2024 at 10:02 AM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org>, Aurelien Bouteiller <bouteill at icl.utk.edu>
Cc: "Holmes, Daniel John" <daniel.john.holmes at intel.com>
Subject: [EXTERNAL] Re: [mpiwg-sessions] Sessions WG - meet 1/29/24

Hi all,

WORDS to capture our discussion on Monday 29th Jan.

The basic idea in the example code I gave is to detect process fail-stop faults, shrink the active communicator to exclude the failed processes, and then carry on.

The while loop captures the "carry on" part, only exit when we hit the break statement (or we suffer a process fail-stop fault).
The additional code block captures the "shrink" part by creating a new communicator that excludes failed processes.

The MPI_Group_from_session_pset calls will all produce identical groups, even after processes have failed. This is a local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.
The MPI_Session_get_proc_failed calls give a snapshot of the knowledge contained in the local detector. There is no communication or nonlocal dependence here. This is a (new) local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.
The group manipulation procedures are existing MPI and will work exactly the same as they always have done. These are local procedures and there's no reason for them to return an error. We should guarantee that they never return errors of class MPI_ERR_PROC_FAILED.

The devils are always in the details. The details in this case are all in the implementation of MPI_Comm_create_from_group.

The MPI_Comm_create_from_group procedure must handle some difficult cases:
1. the potential for failed processes to exist in the group that is passed in by the calling MPI process
2. the potential for the group passed in to be different to the groups passed in by other MPI processes
3. the potential for the additional MPI processes to fail during the operation

1. failed before the operation
If this MPI process attempts to communicate with a failed process, it's detector must eventually detect that, otherwise it is broken!
This means that the failure of the other MPI process is discovered during the operation -- see point (3).

2. different groups at different processes
There are several sub-cases here:
a) procA's group includes procB but procB has failed (any time before procA communicates with it)
b) procA's group contains procB but procB's does not include procA
c) procA's group contains procB and procC but procB's group includes procA but does not include procC
d) procA's group and procB's group are identical but one of them later discovers an (a|b|c) problem
Current shrink implementations rely on shared knowledge of the survivors so each involved process can independently create the same spanning tree as all the other involved processes.
This simplifies the design and results in a better worst-case performance scaling, at least asymptotically [ED: pls check]
All of the above sub-cases can happen during a shrink operation, so intuitively this is no worse than that.
We have two broad categories of approach here:
- any problem results in an error, return immediately without creating the output communicator
- any FT problem is dealt with internally; some kind of best-effort communicator is created
We probably want a resilient algorithm that doesn’t have dreadful scaling.

3. failures during the operation
We can just do an agreement at the end to catch failures that happen after the critical moment when other processes communicate with it.

Resilient algorithm:
Start optimistic: Create a spanning tree from the group members you have been given in the call.
Be eager to help: Always listen for incoming protocol messages related to communicator creation.
Attempt communication with your direct children;
  if new failure detected, update the local group, update the local detector, fix the spanning tree (skip that child, but add that child's children as your direct children)
  if a child reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)
Attempt communication with your parent;
  if new failure detected, update the local group, update the local detector, fix the spanning tree (skip the parent, but add that parent's parent as your direct parent)
  if the parent reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)
Assume it all worked beautifully but execute MPI_Comm_agree to make certain.
  if new failure detected, update the local group, update the local detector, start again
  if the agreement succeeds without discovering new failures, we're done!

Best-case asymptotic scaling is O(log(p)), a spanning tree traversal of P processes.
Worst-case O(sum_i=1,p-1(log(p-i)) < O(p log(p)), no more than p-1 tree traversals with a shrinking tree size.

Best wishes,
Dan.

From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org> On Behalf Of Holmes, Daniel John via mpiwg-sessions
Sent: Monday, January 29, 2024 4:13 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org>; Aurelien Bouteiller <bouteill at icl.utk.edu>
Cc: Holmes, Daniel John <daniel.john.holmes at intel.com>
Subject: Re: [mpiwg-sessions] Sessions WG - meet 1/29/24

Hi Howard/all,

Here is the simple code I was talking about in the meeting today:

// general high-level optimistic application
void main() {

     MPI_Session session;
     MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &session);
     MPI_Group group;
     MPI_Group_from_session_pset(session, "mpi://world", &group);
     MPI_Comm comm;
     MPI_Comm_create_from_group(group, &comm);

     ret = do_stuff_with_comm(comm);

     if (MPI_SUCCESS == ret) {
           MPI_Comm_disconnect(&comm);
           MPI_Session_Finalize(&session);
           break;

     } else {
           panic();

     }
}

// general high-level pragmatic application
void main() {

     // additional code
     while (1) {

     MPI_Session session;
     MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &session);
     MPI_Group world, failed, group;
     MPI_Group_from_session_pset(session, "mpi://world", &world);

     // additional code
     MPI_Session_get_proc_failed(session, &failed); // new API, seems easy to do
     MPI_Group_difference(world, failed, &group);
     MPI_Group_free(&world);
     MPI_Group_free(&failed);

     MPI_Comm comm;
     MPI_Comm_create_from_group(group, &comm); // <-- the detail-devils live here
     MPI_Group_free(&group);

     ret = do_stuff_with_comm(comm);

     MPI_Comm_disconnect(&comm);
     MPI_Session_Finalize(&session);

     if (MPI_SUCCESS == ret) {
           break; // all done!

     } else if (MPI_ERR_PROC_FAILED == ret) {
           continue; // no more panic

     }

     // additional code
     } // end while
}

Best wishes,
Dan.

From: mpiwg-sessions <mpiwg-sessions-bounces at lists.mpi-forum.org<mailto:mpiwg-sessions-bounces at lists.mpi-forum.org>> On Behalf Of Pritchard Jr., Howard via mpiwg-sessions
Sent: Thursday, January 25, 2024 6:07 PM
To: MPI Sessions working group <mpiwg-sessions at lists.mpi-forum.org<mailto:mpiwg-sessions at lists.mpi-forum.org>>
Cc: Pritchard Jr., Howard <howardp at lanl.gov<mailto:howardp at lanl.gov>>
Subject: [mpiwg-sessions] Sessions WG - meet 1/29/24

Hi Folks,

Let’s meet on 1/29 to continue discussions related to sessions and FT.

I think what will help is to consider several use cases and implications.

Here are some I have

  *   App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group,  but also wants to use methods available in slice1 of ULFM proposal to shrink/repair communicators.  Are there any problems?
  *   App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group, and wants to use methods available in slice 1 of ULFM proposal to create new group from a pset and create a new communicator
  *   App using sessions to init/finalize, etc. and when a fail-stop error is detected destroy the session, create a new session query for process sets, etc. and start all over.

We should also consider the behavior of MPI_Comm_create_from_group if a process failure occurs while creating a new communicator.  The ULFM slice 1 discusses behavior of MPI_COMM_DUP and process failure.  We’d probably want similar behavior for MPI_Comm_create_from_group.

For those with access, the ULFM slice 1 PR is at https://github.com/mpi-forum/mpi-standard/pull/947<https://urldefense.com/v3/__https:/github.com/mpi-forum/mpi-standard/pull/947__;!!Bt8fGhp8LhKGRg!FQpQE-vRq0MA4P6dPrB5NsdiFgWQ-SLsy4uZvj3iVjVrKNb62cGBeVOefebRWqLjS2k6t4cm7x7F3EOd__pmTvxOY531RGi2Og$>

Thanks,

Howard

-------

[signature_61897647]
Howard Pritchard
Research Scientist
HPC-ENV

Los Alamos National Laboratory
howardp at lanl.gov<mailto:howardp at lanl.gov>

[signature_1672648044]<https://urldefense.com/v3/__https:/www.instagram.com/losalamosnatlab/__;!!Bt8fGhp8LhKGRg!FQpQE-vRq0MA4P6dPrB5NsdiFgWQ-SLsy4uZvj3iVjVrKNb62cGBeVOefebRWqLjS2k6t4cm7x7F3EOd__pmTvxOY51FDX5_rQ$>[signature_2067890307]<https://urldefense.com/v3/__https:/twitter.com/LosAlamosNatLab__;!!Bt8fGhp8LhKGRg!FQpQE-vRq0MA4P6dPrB5NsdiFgWQ-SLsy4uZvj3iVjVrKNb62cGBeVOefebRWqLjS2k6t4cm7x7F3EOd__pmTvxOY50FHS6vlA$>[signature_1942525183]<https://urldefense.com/v3/__https:/www.linkedin.com/company/los-alamos-national-laboratory/__;!!Bt8fGhp8LhKGRg!FQpQE-vRq0MA4P6dPrB5NsdiFgWQ-SLsy4uZvj3iVjVrKNb62cGBeVOefebRWqLjS2k6t4cm7x7F3EOd__pmTvxOY52DYZVlLg$>[signature_882949974]<https://urldefense.com/v3/__https:/www.facebook.com/LosAlamosNationalLab/__;!!Bt8fGhp8LhKGRg!FQpQE-vRq0MA4P6dPrB5NsdiFgWQ-SLsy4uZvj3iVjVrKNb62cGBeVOefebRWqLjS2k6t4cm7x7F3EOd__pmTvxOY53yhqaiAg$>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4351 bytes
Desc: image001.png
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 1982 bytes
Desc: image002.png
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 1518 bytes
Desc: image003.png
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1335 bytes
Desc: image004.png
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 1000 bytes
Desc: image005.png
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-sessions/attachments/20240201/48b6cc87/attachment-0009.png>