<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class="">Hi all,</div><div class=""><br class=""></div><div class=""><div class="">In lieu of a meeting, here’s a long email trying to move us forward:</div><div class=""><br class=""></div><div class="">When we last talked at the December meeting, we went over the fundamentals of what the larger FT proposal needs to include: <a href="https://github.com/mpiwg-ft/ft-issues/wiki/2018-12-04#discussion-of-ft-interoperability" class="">https://github.com/mpiwg-ft/ft-issues/wiki/2018-12-04#discussion-of-ft-interoperability</a>. We decided to work on a few things as a group:</div><div class=""><br class=""></div><div class=""><ol style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em; list-style-type: lower-alpha; caret-color: rgb(36, 41, 46); color: rgb(36, 41, 46); font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 16px;" class=""><li style="box-sizing: border-box;" class="">Error codes & Error handlers<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;" class=""><li style="box-sizing: border-box;" class="">This includes both scoped error handlers (currently in MPI 3.1) and universal error handlers (which would alert you about an error anywhere in the set of connected processes).</li></ul></li><li style="box-sizing: border-box; margin-top: 0.25em;" class="">Function to get a group of failed processes<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;" class=""><li style="box-sizing: border-box;" class="">This is different from <code style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.600000381469727px; background-color: rgba(27, 31, 35, 0.0470588); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; margin: 0px; padding: 0.2em 0.4em;" class="">MPI_COMM_FAILURE_ACK</code> / <code style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.600000381469727px; background-color: rgba(27, 31, 35, 0.0470588); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; margin: 0px; padding: 0.2em 0.4em;" class="">MPI_COMM_FAILURE_GET_ACKED</code>because of objection to the order of acking and then getting the list of acked processes.</li></ul></li><li style="box-sizing: border-box; margin-top: 0.25em;" class="">Failure acknowledgement function that takes a group<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;" class=""><li style="box-sizing: border-box;" class="">Allows the user to restart <code style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.600000381469727px; background-color: rgba(27, 31, 35, 0.0470588); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; margin: 0px; padding: 0.2em 0.4em;" class="">MPI_ANY_SOURCE</code> communication</li></ul></li><li style="box-sizing: border-box; margin-top: 0.25em;" class=""><code style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.600000381469727px; background-color: rgba(27, 31, 35, 0.0470588); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; margin: 0px; padding: 0.2em 0.4em;" class="">MPI_COMM_CREATE_GROUP</code></li><li style="box-sizing: border-box; margin-top: 0.25em;" class="">Communicator-based resilient broadcast that triggers error handling on other processes<ul style="box-sizing: border-box; margin-bottom: 0px; margin-top: 0px; padding-left: 2em;" class=""><li style="box-sizing: border-box;" class="">This is similar to the existing <code style="box-sizing: border-box; font-family: SFMono-Regular, Consolas, "Liberation Mono", Menlo, Courier, monospace; font-size: 13.600000381469727px; background-color: rgba(27, 31, 35, 0.0470588); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; margin: 0px; padding: 0.2em 0.4em;" class="">MPI_COMM_REVOKE</code></li></ul></li><li style="box-sizing: border-box; margin-top: 0.25em;" class="">Checkpoint MPI state</li><li style="box-sizing: border-box; margin-top: 0.25em;" class="">Return to previous MPI state X</li></ol></div><div class=""><div class=""><br class=""></div></div><div class="">Once we had these things in MPI, we could start looking at “Layer 1” (agree and revoke) and the “Layer 2” (shrink).</div></div><div class=""><br class=""></div><div class="">So here’s the action items where I think we can make progress. I think we need someone to take the lead for each of these to keep moving them forward.</div><div class=""><br class=""></div><div class=""><b class="">Universal Error Handlers</b></div><div class=""><b class=""><br class=""></b></div><div class="">This will require adding a new type of error handler that doesn’t include any sort of communication object (communicator, window, file) as the MPI process getting the alert may not even be in the group of processes where it makes sense.</div><div class=""><br class=""></div><div class=""><b class="">Function to Retrieve Failed Processes of & Acknowledge Failures</b></div><div class=""><b class=""><br class=""></b></div><div class="">During the meeting, it was decided that the current proposal of MPI_COMM_FAILURE_ACK / MPI_COMM_FAILURE_GET_ACKED was not acceptable because of the confusion around acknowledging process failures that you haven’t yet seen. More likely, this would be of the form where a group of failed processes is provided by MPI and then the user acknowledges some subset of those processes via a second function in order to reenable MPI_ANY_SOURCE. If new failures arise that have not yet been acknowledged, MPI_ANY_SOURCE would again be disabled.</div><div class=""><br class=""></div><div class=""><b class="">MPI_COMM_CREATE_GROUP</b></div><div class=""><b class=""><br class=""></b></div><div class="">I think what we might have meant here is the new function being promoted by the Sessions working group to do MPI_COMM_CREATE_FROM_GROUP where a parent communicator is not involved. Otherwise the next topic gets very difficult.</div><div class=""><br class=""></div><div class=""><b class="">Communicator-based Resilient Broadcast that Triggers Error Handling on Other Processes</b></div><div class=""><b class=""><br class=""></b></div><div class="">This is very similar to the existing MPI_COMM_REVOKE. Something to remember if tempted to do much redesign here: attempting to allow a communicator to be “repaired” in place, rather than constructing a new one (transitioning from revoked to un-revoked) is very racy. It’s unclear what happens if the communicator is revoked twice and un-revoked by some in between. Also, because this function is now somewhat decoupled from the idea of “shrinking” a communicator, it needs to be clear how to create a new working communicator (perhaps with MPI_COMM_CREATE_FROM_GROUP as mentioned above).</div><div class=""><br class=""></div><div class=""><b class="">Checkpoint MPI State & Return to Previous State X</b></div><div class=""><b class=""><br class=""></b></div><div class="">This is a totally new topic that goes along with the reinit work that Ignacio has been doing so I think he’s better equipped to offer an initial proposal here.</div><div class=""><br class=""></div><div class="">I’m willing to work on the failure reporting / acknowledgement piece and bring back a proposal in a future meeting. Can others choose other pieces to move forward?</div><div class=""><br class=""></div><div class=""><div class="">Thanks,</div><div class="">Wesley</div></div></body></html>