[mpiwg-ft] Larger FT Proposal Next Steps

Fri Jan 18 08:55:06 CST 2019

Hi all,

In lieu of a meeting, here’s a long email trying to move us forward:

When we last talked at the December meeting, we went over the fundamentals of what the larger FT proposal needs to include: https://github.com/mpiwg-ft/ft-issues/wiki/2018-12-04#discussion-of-ft-interoperability. We decided to work on a few things as a group:

Error codes & Error handlers
This includes both scoped error handlers (currently in MPI 3.1) and universal error handlers (which would alert you about an error anywhere in the set of connected processes).
Function to get a group of failed processes
This is different from MPI_COMM_FAILURE_ACK / MPI_COMM_FAILURE_GET_ACKEDbecause of objection to the order of acking and then getting the list of acked processes.
Failure acknowledgement function that takes a group
Allows the user to restart MPI_ANY_SOURCE communication
MPI_COMM_CREATE_GROUP
Communicator-based resilient broadcast that triggers error handling on other processes
This is similar to the existing MPI_COMM_REVOKE
Checkpoint MPI state
Return to previous MPI state X

Once we had these things in MPI, we could start looking at “Layer 1” (agree and revoke) and the “Layer 2” (shrink).

So here’s the action items where I think we can make progress. I think we need someone to take the lead for each of these to keep moving them forward.

Universal Error Handlers

This will require adding a new type of error handler that doesn’t include any sort of communication object (communicator, window, file) as the MPI process getting the alert may not even be in the group of processes where it makes sense.

Function to Retrieve Failed Processes of & Acknowledge Failures

During the meeting, it was decided that the current proposal of MPI_COMM_FAILURE_ACK / MPI_COMM_FAILURE_GET_ACKED was not acceptable because of the confusion around acknowledging process failures that you haven’t yet seen. More likely, this would be of the form where a group of failed processes is provided by MPI and then the user acknowledges some subset of those processes via a second function in order to reenable MPI_ANY_SOURCE. If new failures arise that have not yet been acknowledged, MPI_ANY_SOURCE would again be disabled.

MPI_COMM_CREATE_GROUP

I think what we might have meant here is the new function being promoted by the Sessions working group to do MPI_COMM_CREATE_FROM_GROUP where a parent communicator is not involved. Otherwise the next topic gets very difficult.

Communicator-based Resilient Broadcast that Triggers Error Handling on Other Processes

This is very similar to the existing MPI_COMM_REVOKE. Something to remember if tempted to do much redesign here: attempting to allow a communicator to be “repaired” in place, rather than constructing a new one (transitioning from revoked to un-revoked) is very racy. It’s unclear what happens if the communicator is revoked twice and un-revoked by some in between. Also, because this function is now somewhat decoupled from the idea of “shrinking” a communicator, it needs to be clear how to create a new working communicator (perhaps with MPI_COMM_CREATE_FROM_GROUP as mentioned above).

Checkpoint MPI State & Return to Previous State X

This is a totally new topic that goes along with the reinit work that Ignacio has been doing so I think he’s better equipped to offer an initial proposal here.

I’m willing to work on the failure reporting / acknowledgement piece and bring back a proposal in a future meeting. Can others choose other pieces to move forward?

Thanks,
Wesley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20190118/034274f1/attachment.html>