[mpiwg-ft] MPI Forum Wrap-Up

Bland, Wesley wesley.bland at intel.com
Mon Dec 14 09:11:28 CST 2015

Hi Jeff. Thanks for the comments.

On Dec 14, 2015, at 8:19 AM, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:

On Fri, Dec 11, 2015 at 2:09 PM, Bland, Wesley <wesley.bland at intel.com<mailto:wesley.bland at intel.com>> wrote:
Hi WG,

I’ve put together some notes on the goings on around our working group at the forum. You can find them all on the wiki page:


Since I know that the click-through is not always practical, I’ll copy them below.



WG Meeting

  *   Went over reading for plenary time
  *   Aurelien and Keita presented some of the results of the ULFM BoF at SC
     *   Attendance was great
     *   There were a few questions and suggestions to improve the proposal.
        *   Aurelien is creating issues for suggests that we will act on.
  *   We discussed an overall view of what fault tolerance and error handling means in the context of MPI and how we cover each area as a standard
     *   We divided applications into a few buckets:
        *   Current applications - This describes the vast majority of applications which require that all process remain alive and recovery tends to be global.
           *   These apps tend to use/require recovery very similar to checkpoint/restart.
           *   They probably don't derive a lot of benefit from ULFM-like recovery models, but could potentially benefit from improved error handlers.

Just remember that there can be ULFM + { hot spares -or- MPI_Comm_spawn(_multiple) } + checkpoint-restart, which preserves the size of the job.

Lots of those fault-intolerant apps can do something like this with minimal changes.  The apps that cannot use ULFM are the ones that require expensive message logging to be able to roll-back to a coherent state.

Agreed. We had some discussion of this in the room. The gist of the conversation was that between ULFM and existing C/R, most of these cases are covered in a reasonable way.

        *   In-memory Checkpoint/Restart - These apps can use in-memory checkpointing to improve both checkpoint and recovery times. They usually need to replace failed processes, but don't require that all remain alive.
           *   ULFM is a possibility here, but can result in bad locality without a library which will automatically move processes around after a failure.

Maybe.  I'm not sure good locality exists on fat-tree and dragonfly topologies.  And users can always renumber their ranks and move data around if they know topological placement matters.

That’s true. What we were talking about was having MPI be able to do that for you. It’s always possible for applications to do this themselves (as Aurelien pointed out in the room), but sometimes its such a pain that if MPI already has that information, it might be better to just provide it.

           *   Reinit / multi-init/finalize with improved PMPI would also work. There are some proposals going on or that have gone on which could also provide the needed functionality. In these proposals, most of the locality problems would probably be pushed into the MPI library when initialized again.
        *   New applications - These apps tend to be able to run with fewer processes. They cover apps like tasking models, master/worker apps, and traditionally non-MPI apps that might be interested in the future (Hadoop, etc.).

Lots of current applications are master-worker…

Sure. New here could just mean less than 30 years old. :)

           *   ULFM generally would apply well to these applications as locality is less important if processes are not being replaced.
     *   There are also errors that do not include process failures:
        *   Memory errors

It is useful to distinguish causes and effects here.  Memory errors are a cause.  Process failure is one effect.  Another effect is non-fatal data corruption, which may or may not be silent.  Currently, we see that memory errors that are detected manifest as process failures, but hopefully someday the OS/RT people will figure out better things to do than just call abort().  Ok, to be fair, it's probably the firmware/driver people throwing the self-destruct lever…

I should have been more precise. The errors we were talking about here are the ones that don’t result in process failure. We’re talking more about SDC types of errors.

           *   These could be detected by anything, but ULFM revoke could help with notification.
           *   Lots of SDC research is out there that sits on top of MPI.
        *   Network errors

I am curious how often this actually happens anymore.  Aren't all modern networks capable of routing around permanently failed links?  A switch failure should be fatal but that happens how often?

Agreed. That’s why we didn’t focus on it too much. This is something that we generally push down to the implementation (or lower).

           *   These tend to be masked by the implementation or promoted to process failures
        *   Resource exhaustion
           *   These sorts of errors cover out of memory, out of context IDs, etc.
           *   They can be improved with better error handlers/codes/classes

Yes, and this will be wildly useful.  Lots of users lose jobs because they do dumb stuff with communicators that could be mitigated with slower fallback implementations over p2p (please ignore the apparent false-dichotomy here).

  *   Discussed some new topics related to error handling and error codes/classes
     *   Pavan expressed interest in error codes saying whether they were catastrophic or not.
        *   This resulted mpi-forum/mpi-issues#28<https://github.com/mpi-forum/mpi-issues/issues/28> where we add a new call MPI_ERROR_IS_CATASTROPHIC.

<https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time>Plenary Time

  *   Read the error handler cleanup tickets mpi-forum/mpi-issues#1<https://github.com/mpi-forum/mpi-issues/issues/1> and mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3>.
     *   The forum didn't like where we removed all of the text about general errors. They considered some of it to still be valuable and should be updated. In particular, the example about MPI_RSEND could still be applicable if the implementation decides that it wants to return an error to the user because the MPI_RECV was not posted.
     *   We need to add text for MPI_INTERCOMM_CREATE.
     *   A few other minor things were added directly to the pull request.
  *   Read the MPI_COMM_FREE advice ticket.
     *   No concerns, will vote at next meeting.
  *   Presented the plenary about catastrophic errors.
     *   Few concerns were raised during the plenary. The main one was from Bill who says we should look at how other standards describe non-fatal errors when writing the text here.

I'm skeptical that this is going to help us, but here are some references:
- Fortran 2008 section 14.6 Halting (not going to be useful to us, although its utility for users is demonstrated in NOTE 14.16); section 2.3.5 describes how any error propagates to all images (in a coarray program).  There is no notion of fault-tolerance here.
- UPC and OpenSHMEM say nothing about fault-tolerance.  I suspect that UPC never will and OpenSHMEM will try to learn from the MPI Forum.
- C++14 draft (N3936) chapter 19 is all about exceptions and errors.  30.2.2 talks about thread failure.
- IB 1.0 7.12.2 and 7.12.3, among other places.

Most specifications that I've read don't have all of the baggage we do, but only because most of their objects are not so stateful or long-lived.  Now, if MPI was based upon connections rather than communicators…

Thanks for the pointers. Maybe after the new year, we can do some homework and see what lessons we can take from here. Note that we weren’t talking about (non)catastrophic in terms of fault tolerance. It was mostly just for correct error handling and to let MPI remain defined in a few more error states than it already does (none).


     *   Ryan asked about the general usefulness of this proposal in terms of how an application would be able to respond to information about whether an error is fatal or not.
        *   He asserts that error classes should generally be descriptive enough without it and if they aren't, the error class itself should be improved.



mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>

Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com>
mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>

More information about the mpiwg-ft mailing list