[mpiwg-ft] FTWG Forum Update

Thu Mar 6 16:32:41 CST 2014

Hi Wesley,

Thanks a lot for your work and typing up this very detailed report for those of us who weren't at the Forum.

Sayantan

From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of Bland, Wesley B.
Sent: Thursday, March 06, 2014 2:08 PM
To: MPI WG Fault Tolerance and Dynamic Process Control working Group
Subject: [mpiwg-ft] FTWG Forum Update

FT Working Group Members,

I wanted to provide an update on everything that's gone on at the forum. Sorry about the length, but I want to be complete.

We had lots of discussions this week about the FT proposal with both Monday and Wednesday entirely dedicated to fault tolerance (Thanks to those who sat through it and provided feedback!). Monday was general discussion about the proposal. However, most of the new discussion centered around an alternative "proposal" presented by Todd Gamblin from LLNL. I don't say "proposal" to be glib, only to point out that they weren't presenting this as a fully formed proposal, just a set of alternative ideas.

I won't present the entire proposal from LLNL (though I encourage that discussion to continue), however I will try to summarize and point everyone to the Github page with the header file that includes the spec: https://github.com/tgamblin/mpi-resilience. Again, this isn't a fully baked proposal and we've already had a long discussion that has resulted in the conclusion that this proposal would be entirely reformed before actually being considered, so don't judge the specifics of the proposal so much as the general ideas.

What LLNL proposed amounts to native support for bulk synchronous rollback to INIT. Their assertion is that most current codes would not be able to benefit from the ULFM proposal because the barrier to entry is too high. The amount of code that would have to change in order to do something simple, like revoke all communicators, jump back to INIT, rebuild MPI_COMM_WORLD, and continue. For their applications, they don't need all of the extra stuff because this is the only recovery model that makes sense. By providing a recovery model that supports that behavior, we would be picking the low hanging fruit of making an immediate impact on lots of applications that are currently available.

The specifics of their proposal are that MPI would add a few things to make their recovery model possible. Everything boils down to repairing one of two things: the MPI state, and the application data. To repair MPI state, they propose a function called MPI_Fault which would cause all processes in MPI_COMM_WORLD to jump back to MPI_Reinit (my understanding is that this call is not collective). During that process, MPI also cleans up all of its internal state so everything is back as if it were at the first calling of MPI_(Re)init. It also makes an attempt to get MPI_COMM_WORLD back to original size. It may fail at that, but that behavior would be the same as in MPI_Init. This ensures that MPI is clean again and the user can continue on as if they were restarted from checkpoint, just without actually being restarted and losing the batch reservation. To repair application state, a new idea is introduced called MPI_Cleanup_handlers that the application would register whenever they have something that they need to be cleaned up when MPI_Fault is called. For this particular idea, we agreed that we could probably push this out of MPI since it doesn't touch MPI internal state.

There was some other stuff about this proposal related to synchronous vs. asynchronous recovery and some other things, but I won't get into all of that. I'd encourage you to read more about it in the link above if you're interested. The result of this discussion is that we agreed that most of this proposal could be implemented on top of MPI with a shim library. There would be a few new things required to make this work, the most pressing would be defining a new kind of MPI_Errhandler (or redefining existing MPI_Errhandlers) to allow essentially long jumping back to MPI_Init. The other thing that would be necessary (and this should happen regardless probably) is to allow the application to attach (and remove) multiple error handlers to (from) a communication object.

The discussion for that proposal will continue going forward, hopefully via emails and con call discussions including LLNL.

The second thing happening this week was that I visited LBL and presented the ULFM proposal to that group. There wasn't a lot of feedback from that group, but there was one item that we should address. There was a request for a non-blocking Shrink. While it would certainly be tricky to use, there was a legitimate use case presented by Keita Teranishi from Sandia. He demonstrated that its helpful for performance reasons to be able to do local cleanup while the Shrink is ongoing.

The last major thing that happened was that on Wednesday we attempted a reading of the ULFM proposal. We were not able to count what we did as an official reading because there were significant changes, specifically in the RMA section, but elsewhere as well. After the "reading", we did take a straw poll to ask, "After we make the changes that you asked for, would how would you vote?" The results were 9 Yes (3 ANL), 4 No (all LLNL), 7 Abstain. Other than the notations, I think it was about 1 vote per institution. I didn't make a note of everyone who voted each way, but I think the Yes votes were: Bill, ANL (Me, Pavan, Ken), Mellanox, and a few others (all universities I think).The No votes were all LLNL. The Abstain votes were pretty much all vendors who said that they needed more time to think about this. I think that particular response is similar to the point that Sayantan has already raised, that we need to get the labs behind us, which means demonstrating a large application.

Here's the specific comments that we received:

The RMA working group (mostly Bill & Pavan) objected to the guarantees that we were trying to make for data correctness. They didn't want to try to guarantee that data that had only been targeted by reading operations was also correct because there were scenarios where that made the implementation prohibitively expensive. That resulted in remove a bunch of text. We also added the collective semantics back to MPI_WIN_FREE to prevent the MPI library from overwriting data after MPI_WIN_FREE was done just because a process failure stopped the synchronization semantics. The last thing was that we can't require MPI_WIN_FLUSH and MPI_WIN_FLUSH_LOCAL to return an error (we do right now under the definition of synchronization). This is too expensive for the failure-free case.

The semantics of MPI_COMM_REVOKE related to intercommunicators are unclear. Does this enforce disconnecting?

How do you "validate" (using AGREE?) a communicator created with MPI_COMM_CREATE_GROUP?

We can simplify the definition of MPI_INIT. The standard already says that anything that happens before MPI_INIT completes will call Abort by definition. If we want to change that definition, that's separate (and more complex).

MPI_FINALIZE shouldn't be forced to return success. It's true that it's possible you won't be able to do any recovery at that point, but you will still know that something bad happened. Masking that doesn't change the fact that there was a failure that you might not be able to do anything about.

We need to be more specific about the definition of MPI_COMM_IAGREE to say that only failures acknowledged *before* the initialization call are excluded from returning errors.

The forum didn't like that we try to define the value of flag (in MPI_COMM_AGREE) when the call returns a failure. This is bad software engineering.

Change the logical AND to a bitwise AND for MPI_COMM_AGREE. There's no additional implementation burden and it's useful. Implementing full datatype support would be overkill so they're ok with that.

We need to add to the definition of "involved" to mention RMA calls.

Fix error codes != error classes in all of the examples.

Overall, I actually think the reading went relatively well. There were no technical objections that we couldn't overcome. The main request/concern is that people can't wrap their head around it and we need to help them by demonstrating a real application use case. Obviously, that's not an insignificant amount of work, but I also don't know any better way of convincing everyone that this is the best way to do FT.

We'll discuss this more on the next call, but there's the brain dump for now. Thanks for all of your help everyone!

Wesley
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20140306/c20ac7e9/attachment-0001.html>