<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Just curious - how do you figure to handle the cascade-of-failure scenario? This is by far the most common as you either lose a node (which means the detection of individual failure for each proc on that node) or you lose a switch (and therefore get report of individual failure for all procs communicating over that device). Kind of rare for a proc to just die on its own (except for contrived tests, of course), though not impossible.<div><br></div><div>Ralph</div><div><br id="lineBreakAtBeginningOfMessage"><div><br><blockquote type="cite"><div>On Feb 1, 2024, at 10:02 AM, Holmes, Daniel John via mpiwg-sessions <mpiwg-sessions@lists.mpi-forum.org> wrote:</div><br class="Apple-interchange-newline"><div><meta charset="UTF-8"><div class="WordSection1" style="page: WordSection1; caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Hi all,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">WORDS to capture our discussion on Monday 29th Jan.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The basic idea in the example code I gave is to detect process fail-stop faults, shrink the active communicator to exclude the failed processes, and then carry on.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The while loop captures the "carry on" part, only exit when we hit the break statement (or we suffer a process fail-stop fault).<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The additional code block captures the "shrink" part by creating a new communicator that excludes failed processes.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The MPI_Group_from_session_pset calls will all produce identical groups, even after processes have failed. This is a local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The MPI_Session_get_proc_failed calls give a snapshot of the knowledge contained in the local detector. There is no communication or nonlocal dependence here. This is a (new) local procedure and there's no reason for it to return an error. We should guarantee that it never returns errors of class MPI_ERR_PROC_FAILED.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The group manipulation procedures are existing MPI and will work exactly the same as they always have done. These are local procedures and there's no reason for them to return an error. We should guarantee that they never return errors of class MPI_ERR_PROC_FAILED.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The devils are always in the details. The details in this case are all in the implementation of MPI_Comm_create_from_group.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">The MPI_Comm_create_from_group procedure must handle some difficult cases:<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">1. the potential for failed processes to exist in the group that is passed in by the calling MPI process<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">2. the potential for the group passed in to be different to the groups passed in by other MPI processes<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">3. the potential for the additional MPI processes to fail during the operation<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">1. failed before the operation<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">If this MPI process attempts to communicate with a failed process, it's detector must eventually detect that, otherwise it is broken!<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">This means that the failure of the other MPI process is discovered during the operation -- see point (3).<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">2. different groups at different processes<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">There are several sub-cases here:<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">a) procA's group includes procB but procB has failed (any time before procA communicates with it)<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">b) procA's group contains procB but procB's does not include procA<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">c) procA's group contains procB and procC but procB's group includes procA but does not include procC<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">d) procA's group and procB's group are identical but one of them later discovers an (a|b|c) problem<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Current shrink implementations rely on shared knowledge of the survivors so each involved process can independently create the same spanning tree as all the other involved processes.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">This simplifies the design and results in a better worst-case performance scaling, at least asymptotically [ED: pls check]<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">All of the above sub-cases can happen during a shrink operation, so intuitively this is no worse than that.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">We have two broad categories of approach here:<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">- any problem results in an error, return immediately without creating the output communicator<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">- any FT problem is dealt with internally; some kind of best-effort communicator is created<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">We probably want a resilient algorithm that doesn’t have dreadful scaling.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">3. failures during the operation<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">We can just do an agreement at the end to catch failures that happen after the critical moment when other processes communicate with it.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Resilient algorithm:<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Start optimistic: Create a spanning tree from the group members you have been given in the call.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Be eager to help: Always listen for incoming protocol messages related to communicator creation.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Attempt communication with your direct children;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if new failure detected, update the local group, update the local detector, fix the spanning tree (skip that child, but add that child's children as your direct children)<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if a child reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Attempt communication with your parent;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if new failure detected, update the local group, update the local detector, fix the spanning tree (skip the parent, but add that parent's parent as your direct parent)<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if the parent reports a different group (compare hashes), figure out the difference, update the local detector, fix the spanning tree (recreate it using only uncontacted undead processes)<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Assume it all worked beautifully but execute MPI_Comm_agree to make certain.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if new failure detected, update the local group, update the local detector, start again<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"> if the agreement succeeds without discovering new failures, we're done!<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Best-case asymptotic scaling is O(log(p)), a spanning tree traversal of P processes.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Worst-case O(sum_i=1,p-1(log(p-i)) < O(p log(p)), no more than p-1 tree traversals with a shrinking tree size.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Best wishes,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Dan.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div><div style="border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(225, 225, 225) currentcolor currentcolor; border-image: none; padding: 3pt 0cm 0cm;"><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span lang="EN-US" style="font-family: Calibri, sans-serif;">From:</span></b><span lang="EN-US" style="font-family: Calibri, sans-serif;"><span class="Apple-converted-space"> </span>mpiwg-sessions <<a href="mailto:mpiwg-sessions-bounces@lists.mpi-forum.org" style="color: rgb(70, 120, 134); text-decoration: underline;">mpiwg-sessions-bounces@lists.mpi-forum.org</a>><span class="Apple-converted-space"> </span><b>On Behalf Of<span class="Apple-converted-space"> </span></b>Holmes, Daniel John via mpiwg-sessions<br><b>Sent:</b><span class="Apple-converted-space"> </span>Monday, January 29, 2024 4:13 PM<br><b>To:</b><span class="Apple-converted-space"> </span>MPI Sessions working group <<a href="mailto:mpiwg-sessions@lists.mpi-forum.org" style="color: rgb(70, 120, 134); text-decoration: underline;">mpiwg-sessions@lists.mpi-forum.org</a>>; Aurelien Bouteiller <<a href="mailto:bouteill@icl.utk.edu" style="color: rgb(70, 120, 134); text-decoration: underline;">bouteill@icl.utk.edu</a>><br><b>Cc:</b><span class="Apple-converted-space"> </span>Holmes, Daniel John <<a href="mailto:daniel.john.holmes@intel.com" style="color: rgb(70, 120, 134); text-decoration: underline;">daniel.john.holmes@intel.com</a>><br><b>Subject:</b><span class="Apple-converted-space"> </span>Re: [mpiwg-sessions] Sessions WG - meet 1/29/24<o:p></o:p></span></div></div></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><o:p> </o:p></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Hi Howard/all,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Here is the simple code I was talking about in the meeting today:<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">// general high-level optimistic application<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">void main() {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session session;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_ARE_FATAL, &session);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group group;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_from_session_pset(session, "<a href="mpi://world" style="color: rgb(70, 120, 134); text-decoration: underline;">mpi://world</a>", &group);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm comm;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm_create_from_group(group, &comm);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> ret = do_stuff_with_comm(comm);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> if (MPI_SUCCESS == ret) {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm_disconnect(&comm);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session_Finalize(&session);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> break;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> } else {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> panic();<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> }<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">}<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">// general high-level pragmatic application<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">void main() {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> // additional code<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> while (1) {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session session;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session_Init(MPI_INFO_NULL, MPI_ERRORS_RETURN, &session);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group world, failed, group;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_from_session_pset(session, "<a href="mpi://world" style="color: rgb(70, 120, 134); text-decoration: underline;">mpi://world</a>", &world);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> // additional code<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session_get_proc_failed(session, &failed); // new API, seems easy to do<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_difference(world, failed, &group);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_free(&world);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_free(&failed);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm comm;<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm_create_from_group(group, &comm); // <-- the detail-devils live here<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Group_free(&group);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> ret = do_stuff_with_comm(comm);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Comm_disconnect(&comm);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> MPI_Session_Finalize(&session);<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> if (MPI_SUCCESS == ret) {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> break; // all done!<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> } else if (MPI_ERR_PROC_FAILED == ret) {<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> continue; // no more panic<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> }<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> // additional code<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";"> } // end while<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: "Courier New";">}<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Best wishes,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;">Dan.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><div><div style="border-width: 1pt medium medium; border-style: solid none none; border-color: rgb(225, 225, 225) currentcolor currentcolor; border-image: none; padding: 3pt 0cm 0cm;"><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span lang="EN-US" style="font-family: Calibri, sans-serif;">From:</span></b><span lang="EN-US" style="font-family: Calibri, sans-serif;"><span class="Apple-converted-space"> </span>mpiwg-sessions <<a href="mailto:mpiwg-sessions-bounces@lists.mpi-forum.org" style="color: rgb(70, 120, 134); text-decoration: underline;">mpiwg-sessions-bounces@lists.mpi-forum.org</a>><span class="Apple-converted-space"> </span><b>On Behalf Of<span class="Apple-converted-space"> </span></b>Pritchard Jr., Howard via mpiwg-sessions<br><b>Sent:</b><span class="Apple-converted-space"> </span>Thursday, January 25, 2024 6:07 PM<br><b>To:</b><span class="Apple-converted-space"> </span>MPI Sessions working group <<a href="mailto:mpiwg-sessions@lists.mpi-forum.org" style="color: rgb(70, 120, 134); text-decoration: underline;">mpiwg-sessions@lists.mpi-forum.org</a>><br><b>Cc:</b><span class="Apple-converted-space"> </span>Pritchard Jr., Howard <<a href="mailto:howardp@lanl.gov" style="color: rgb(70, 120, 134); text-decoration: underline;">howardp@lanl.gov</a>><br><b>Subject:</b><span class="Apple-converted-space"> </span>[mpiwg-sessions] Sessions WG - meet 1/29/24<o:p></o:p></span></div></div></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><o:p> </o:p></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">Hi Folks,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">Let’s meet on 1/29 to continue discussions related to sessions and FT. <o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">I think what will help is to consider several use cases and implications.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">Here are some I have<o:p></o:p></span></div><ul type="disc" style="margin-bottom: 0cm; margin-top: 0cm;"><li class="MsoListParagraph" style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group, but also wants to use methods available in slice1 of ULFM proposal to shrink/repair communicators. Are there any problems?<o:p></o:p></span></li><li class="MsoListParagraph" style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">App using sessions to init/finalize and create at least one initial communicator with MPI_Comm_create_from_group, and wants to use methods available in slice 1 of ULFM proposal to create new group from a pset and create a new communicator<o:p></o:p></span></li><li class="MsoListParagraph" style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">App using sessions to init/finalize, etc. and when a fail-stop error is detected destroy the session, create a new session query for process sets, etc. and start all over.<span class="Apple-converted-space"> </span><o:p></o:p></span></li></ul><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">We should also consider the behavior of MPI_Comm_create_from_group if a process failure occurs while creating a new communicator. The ULFM slice 1 discusses behavior of MPI_COMM_DUP and process failure. We’d probably want similar behavior for MPI_Comm_create_from_group.<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">For those with access, the ULFM slice 1 PR is at<span class="Apple-converted-space"> </span><a href="https://github.com/mpi-forum/mpi-standard/pull/947" style="color: rgb(70, 120, 134); text-decoration: underline;">https://github.com/mpi-forum/mpi-standard/pull/947</a><o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">Thanks,<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US">Howard<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div><div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US" style="font-family: Calibri, sans-serif;">-------<o:p></o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US" style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div><table class="MsoNormalTable" border="0" cellspacing="0" cellpadding="0" width="366" style="width: 274.5pt; border-collapse: collapse;"><tbody><tr style="height: 129.6pt;"><td width="76" valign="top" style="width: 56.95pt; padding: 0cm 5.4pt; height: 129.6pt;"><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-size: 9pt; font-family: Arial, sans-serif;"><span id="cid:image001.png@01DA552F.F06648F0"><image001.png></span></span><span style="font-size: 9pt; font-family: Arial, sans-serif;"><o:p></o:p></span></div></td><td width="290" valign="top" style="width: 217.55pt; padding: 0cm 5.4pt; height: 129.6pt;"><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 12pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140);">Howard Pritchard<o:p></o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 10pt; font-family: Arial, sans-serif; color: rgb(84, 89, 97);">Research Scientist<o:p></o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 10pt; font-family: Arial, sans-serif; color: rgb(84, 89, 97);">HPC-ENV<o:p></o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span style="font-size: 9pt; font-family: Arial, sans-serif;"><o:p> </o:p></span></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140);">Los Alamos National Laboratory<o:p></o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140);"><a href="mailto:howardp@lanl.gov" style="color: rgb(70, 120, 134); text-decoration: underline;"><span style="color: rgb(5, 99, 193);">howardp@lanl.gov</span></a><o:p></o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140);"><o:p> </o:p></span></b></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><a href="https://www.instagram.com/losalamosnatlab/" style="color: rgb(70, 120, 134); text-decoration: underline;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140); text-decoration: none;"><span id="cid:image002.png@01DA552F.F06648F0"><image002.png></span></span></b></a><a href="https://twitter.com/LosAlamosNatLab" style="color: rgb(70, 120, 134); text-decoration: underline;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140); text-decoration: none;"><span id="cid:image003.png@01DA552F.F06648F0"><image003.png></span></span></b></a><a href="https://www.linkedin.com/company/los-alamos-national-laboratory/" style="color: rgb(70, 120, 134); text-decoration: underline;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140); text-decoration: none;"><span id="cid:image004.png@01DA552F.F06648F0"><image004.png></span></span></b></a><a href="https://www.facebook.com/LosAlamosNationalLab/" style="color: rgb(70, 120, 134); text-decoration: underline;"><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140); text-decoration: none;"><span id="cid:image005.png@01DA552F.F06648F0"><image005.png></span></span></b></a><b><span style="font-size: 9pt; font-family: Arial, sans-serif; color: rgb(11, 26, 140);"><o:p></o:p></span></b></div></td></tr></tbody></table><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US" style="font-family: Calibri, sans-serif;"><o:p> </o:p></span></div></div><div style="margin: 0cm; font-size: 11pt; font-family: Aptos, sans-serif;"><span lang="EN-US"><o:p> </o:p></span></div></div><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;">_______________________________________________</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><span style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;">mpiwg-sessions mailing list</span><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><a href="mailto:mpiwg-sessions@lists.mpi-forum.org" style="color: rgb(70, 120, 134); text-decoration: underline; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">mpiwg-sessions@lists.mpi-forum.org</a><br style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;"><a href="https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions" style="color: rgb(70, 120, 134); text-decoration: underline; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">https://lists.mpi-forum.org/mailman/listinfo/mpiwg-sessions</a></div></blockquote></div><br></div></body></html>