[mpiwg-ft] FTWG Forum Update

Fri Mar 7 01:09:12 CST 2014

Hi Wesley, all,

(adding Todd, since I don't think he is signed up for the email list, yet)

On Mar 6, 2014, at 2:07 PM, "Bland, Wesley B." <wbland at anl.gov>
 wrote:

> FT Working Group Members,
> 
> I wanted to provide an update on everything that’s gone on at the forum. Sorry about the length, but I want to be complete.

Thanks for the summary - a few more comments inline below.

Martin

> We had lots of discussions this week about the FT proposal with both Monday and Wednesday entirely dedicated to fault tolerance (Thanks to those who sat through it and provided feedback!). Monday was general discussion about the proposal. However, most of the new discussion centered around an alternative “proposal” presented by Todd Gamblin from LLNL. I don’t say “proposal” to be glib, only to point out that they weren’t presenting this as a fully formed proposal, just a set of alternative ideas.
> 
> I won’t present the entire proposal from LLNL (though I encourage that discussion to continue), however I will try to summarize and point everyone to the Github page with the header file that includes the spec: https://github.com/tgamblin/mpi-resilience. Again, this isn’t a fully baked proposal and we’ve already had a long discussion that has resulted in the conclusion that this proposal would be entirely reformed before actually being considered, so don’t judge the specifics of the proposal so much as the general ideas.
> 
> What LLNL proposed amounts to native support for bulk synchronous rollback to INIT. Their assertion is that most current codes would not be able to benefit from the ULFM proposal because the barrier to entry is too high. 

It's more than the barrier to entry - our users told us repeatedly (we actually had a discussion on this today after the forum again) that the "n-1" approach is a no starter for many/most of our codes (the domain decomposition in many cases can't deal with a missing process and we scale our problems to fully occupy all memory and one node less makes the problem not runnable) and for the rest it is a huge code change. The comment was that if an application isn't designed from scratch to work with the "n-1", it will not be worth changing.

> The amount of code that would have to change in order to do something simple, like revoke all communicators, jump back to INIT, rebuild MPI_COMM_WORLD, and continue.

That would not be the same, though - in this case we still have "n-1" processes. Also, tracking all communicators is not simple and likely requires a PMPI solution, which is also not acceptable. This is a more general problem -  the proposal requires a lot of code changes (error checks, repeated execution of collectives, …) if one wants to apply this, it means either a lot of manual changes (which we were also told is a non-starter, especially in the presence of libraries), automatic code transformations (which is very difficult and hard to get accepted), or a PMPI solution (which would invalidate all tools). The last point could be solved, but this would have to be part of the solution. 

> For their applications, they don’t need all of the extra stuff because this is the only recovery model that makes sense. By providing a recovery model that supports that behavior, we would be picking the low hanging fruit of making an immediate impact on lots of applications that are currently available.

This is the common case and our argument is that we would be standardizing current practice, which is the target of MPI.

> The specifics of their proposal are that MPI would add a few things to make their recovery model possible. Everything boils down to repairing one of two things: the MPI state, and the application data.

I wouldn't say repair - the idea is to blow away all MPI state and start from a clean slate. This way you don't have to worry about what is still working and what not.

> To repair MPI state, they propose a function called MPI_Fault which would cause all processes in MPI_COMM_WORLD to jump back to MPI_Reinit (my understanding is that this call is not collective). During that process, MPI also cleans up all of its internal state so everything is back as if it were at the first calling of MPI_(Re)init. It also makes an attempt to get MPI_COMM_WORLD back to original size. It may fail at that, but that behavior would be the same as in MPI_Init.

Just to clarify - even if this "fails" you still successfully run into the entry point. At that point the application can check the size of the restarted COMM_WORLD and decide whether it should continue or not. Whether the MPI library tries to respawn or add a process to replace a lost process or whether it shrinks COMM_WORLD is outside the standard. This is consistent with us not mandating anything about mpirun. This could even be an option to mpirun.

> This ensures that MPI is clean again and the user can continue on as if they were restarted from checkpoint, just without actually being restarted and losing the batch reservation. To repair application state, a new idea is introduced called MPI_Cleanup_handlers that the application would register whenever they have something that they need to be cleaned up when MPI_Fault is called. For this particular idea, we agreed that we could probably push this out of MPI since it doesn’t touch MPI internal state.

I agree - at least at a first go around this could and should be separated out.

> 
> There was some other stuff about this proposal related to synchronous vs. asynchronous recovery and some other things, but I won’t get into all of that. I’d encourage you to read more about it in the link above if you’re interested. The result of this discussion is that we agreed that most of this proposal could be implemented on top of MPI

I think you mean ULFM?

> with a shim library.

To be honest, I am still not 100% convinced about this. Even if you could it would be hard to get this right and would cost quite some tracking overhead and there is the issue that a shim library is not an option unless we solve the PMPI issue. Also, as a more general comment, this means we would make the most common case, which is common practice for almost all applications, hard, while standardizing the uncommon case.

> There would be a few new things required to make this work, the most pressing would be defining a new kind of MPI_Errhandler (or redefining existing MPI_Errhandlers) to allow essentially long jumping back to MPI_Init. The other thing that would be necessary (and this should happen regardless probably) is to allow the application to attach (and remove) multiple error handlers to (from) a communication object.

I agree - that should be done in any case.

> The discussion for that proposal will continue going forward, hopefully via emails and con call discussions including LLNL.

Hopefully the above is a starting point.

> The second thing happening this week was that I visited LBL and presented the ULFM proposal to that group. There wasn’t a lot of feedback from that group, but there was one item that we should address. There was a request for a non-blocking Shrink. While it would certainly be tricky to use, there was a legitimate use case presented by Keita Teranishi from Sandia. He demonstrated that its helpful for performance reasons to be able to do local cleanup while the Shrink is ongoing.
> 
> The last major thing that happened was that on Wednesday we attempted a reading of the ULFM proposal. We were not able to count what we did as an official reading because there were significant changes, specifically in the RMA section, but elsewhere as well. After the “reading”, we did take a straw poll to ask, “After we make the changes that you asked for, would how would you vote?” The results were 9 Yes (3 ANL), 4 No (all LLNL), 7 Abstain. Other than the notations, I think it was about 1 vote per institution. I didn’t make a note of everyone who voted each way, but I think the Yes votes were: Bill, ANL (Me, Pavan, Ken), Mellanox, and a few others (all universities I think).The No votes were all LLNL. The Abstain votes were pretty much all vendors

Well, not quite true - you also had EPCC in that group who voiced very similar concerns than us after presenting ULFM to their users and LANL was not in the room (and Nathan had similar comments). I forget in which group Sandia fell.

> who said that they needed more time to think about this. I think that particular response is similar to the point that Sayantan has already raised, that we need to get the labs behind us, which means demonstrating a large application.

that is not Monte Carlo or master/slave. 

Also, we talked about running performance tests at large scale (so far I have seen results for at most 4K processes) on more representative sets of benchmarks to understand the impact of including ULFM even if applications don't use it. As far as I know, Todd has started the process to get you and Aurelien accounts on our large (22K core) cluster at LLNL. Nathan also mentioned that LANL has an even larger open cluster to which you may even have still access.

> Here’s the specific comments that we received:
> 
> The RMA working group (mostly Bill & Pavan) objected to the guarantees that we were trying to make for data correctness. They didn’t want to try to guarantee that data that had only been targeted by reading operations was also correct because there were scenarios where that made the implementation prohibitively expensive. That resulted in remove a bunch of text. We also added the collective semantics back to MPI_WIN_FREE to prevent the MPI library from overwriting data after MPI_WIN_FREE was done just because a process failure stopped the synchronization semantics. The last thing was that we can’t require MPI_WIN_FLUSH and MPI_WIN_FLUSH_LOCAL to return an error (we do right now under the definition of synchronization). This is too expensive for the failure-free case.
> 
> The semantics of MPI_COMM_REVOKE related to intercommunicators are unclear. Does this enforce disconnecting?
> 
> How do you “validate” (using AGREE?) a communicator created with MPI_COMM_CREATE_GROUP?
> 
> We can simplify the definition of MPI_INIT. The standard already says that anything that happens before MPI_INIT completes will call Abort by definition. If we want to change that definition, that’s separate (and more complex).
> 
> MPI_FINALIZE shouldn’t be forced to return success. It’s true that it’s possible you won’t be able to do any recovery at that point, but you will still know that something bad happened. Masking that doesn’t change the fact that there was a failure that you might not be able to do anything about.
> 
> We need to be more specific about the definition of MPI_COMM_IAGREE to say that only failures acknowledged *before* the initialization call are excluded from returning errors.
> 
> The forum didn’t like that we try to define the value of flag (in MPI_COMM_AGREE) when the call returns a failure. This is bad software engineering.
> 
> Change the logical AND to a bitwise AND for MPI_COMM_AGREE. There’s no additional implementation burden and it’s useful. Implementing full datatype support would be overkill so they’re ok with that.
> 
> We need to add to the definition of “involved” to mention RMA calls.
> 
> Fix error codes != error classes in all of the examples.

Clearly define interaction with MPI_T

> 
> Overall, I actually think the reading went relatively well. There were no technical objections that we couldn’t overcome. The main request/concern is that people can’t wrap their head around it and we need to help them by demonstrating a real application use case. Obviously, that’s not an insignificant amount of work, but I also don’t know any better way of convincing everyone that this is the best way to do FT.
> 
> We’ll discuss this more on the next call, but there’s the brain dump for now. Thanks for all of your help everyone!
> 
> Wesley
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

________________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
CASC @ Lawrence Livermore National Laboratory, Livermore, USA