[mpiwg-ft] MPI Forum Wrap-Up

Jeff Hammond jeff.science at gmail.com
Mon Dec 14 08:19:05 CST 2015

On Fri, Dec 11, 2015 at 2:09 PM, Bland, Wesley <wesley.bland at intel.com>

> Hi WG,
> I’ve put together some notes on the goings on around our working group at
> the forum. You can find them all on the wiki page:
> https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07
> Since I know that the click-through is not always practical, I’ll copy
> them below.
> Thanks,
> Wesley
> ====
> WG Meeting
>   *   Went over reading for plenary time
>   *   Aurelien and Keita presented some of the results of the ULFM BoF at
> SC
>      *   Attendance was great
>      *   There were a few questions and suggestions to improve the
> proposal.
>         *   Aurelien is creating issues for suggests that we will act on.
>   *   We discussed an overall view of what fault tolerance and error
> handling means in the context of MPI and how we cover each area as a
> standard
>      *   We divided applications into a few buckets:
>         *   Current applications - This describes the vast majority of
> applications which require that all process remain alive and recovery tends
> to be global.
>            *   These apps tend to use/require recovery very similar to
> checkpoint/restart.
>            *   They probably don't derive a lot of benefit from ULFM-like
> recovery models, but could potentially benefit from improved error handlers.

Just remember that there can be ULFM + { hot spares -or-
MPI_Comm_spawn(_multiple) } + checkpoint-restart, which preserves the size
of the job.

Lots of those fault-intolerant apps can do something like this with minimal
changes.  The apps that cannot use ULFM are the ones that require expensive
message logging to be able to roll-back to a coherent state.

>         *   In-memory Checkpoint/Restart - These apps can use in-memory
> checkpointing to improve both checkpoint and recovery times. They usually
> need to replace failed processes, but don't require that all remain alive.
>            *   ULFM is a possibility here, but can result in bad locality
> without a library which will automatically move processes around after a
> failure.

Maybe.  I'm not sure good locality exists on fat-tree and dragonfly
topologies.  And users can always renumber their ranks and move data around
if they know topological placement matters.

>            *   Reinit / multi-init/finalize with improved PMPI would also
> work. There are some proposals going on or that have gone on which could
> also provide the needed functionality. In these proposals, most of the
> locality problems would probably be pushed into the MPI library when
> initialized again.
>         *   New applications - These apps tend to be able to run with
> fewer processes. They cover apps like tasking models, master/worker apps,
> and traditionally non-MPI apps that might be interested in the future
> (Hadoop, etc.).

Lots of current applications are master-worker...

>            *   ULFM generally would apply well to these applications as
> locality is less important if processes are not being replaced.
>      *   There are also errors that do not include process failures:
>         *   Memory errors

It is useful to distinguish causes and effects here.  Memory errors are a
cause.  Process failure is one effect.  Another effect is non-fatal data
corruption, which may or may not be silent.  Currently, we see that memory
errors that are detected manifest as process failures, but hopefully
someday the OS/RT people will figure out better things to do than just call
abort().  Ok, to be fair, it's probably the firmware/driver people throwing
the self-destruct lever...

>            *   These could be detected by anything, but ULFM revoke could
> help with notification.
>            *   Lots of SDC research is out there that sits on top of MPI.
>         *   Network errors

I am curious how often this actually happens anymore.  Aren't all modern
networks capable of routing around permanently failed links?  A switch
failure should be fatal but that happens how often?

>            *   These tend to be masked by the implementation or promoted
> to process failures
>         *   Resource exhaustion
>            *   These sorts of errors cover out of memory, out of context
> IDs, etc.
>            *   They can be improved with better error
> handlers/codes/classes

Yes, and this will be wildly useful.  Lots of users lose jobs because they
do dumb stuff with communicators that could be mitigated with slower
fallback implementations over p2p (please ignore the apparent
false-dichotomy here).

>   *   Discussed some new topics related to error handling and error
> codes/classes
>      *   Pavan expressed interest in error codes saying whether they were
> catastrophic or not.
>         *   This resulted mpi-forum/mpi-issues#28<
> https://github.com/mpi-forum/mpi-issues/issues/28> where we add a new
> <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time>Plenary
> Time
>   *   Read the error handler cleanup tickets mpi-forum/mpi-issues#1<
> https://github.com/mpi-forum/mpi-issues/issues/1> and
> mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3>.
>      *   The forum didn't like where we removed all of the text about
> general errors. They considered some of it to still be valuable and should
> be updated. In particular, the example about MPI_RSEND could still be
> applicable if the implementation decides that it wants to return an error
> to the user because the MPI_RECV was not posted.
>      *   We need to add text for MPI_INTERCOMM_CREATE.
>      *   A few other minor things were added directly to the pull request.
>   *   Read the MPI_COMM_FREE advice ticket.
>      *   No concerns, will vote at next meeting.
>   *   Presented the plenary about catastrophic errors.
>      *   Few concerns were raised during the plenary. The main one was
> from Bill who says we should look at how other standards describe non-fatal
> errors when writing the text here.

I'm skeptical that this is going to help us, but here are some references:
- Fortran 2008 section 14.6 Halting (not going to be useful to us, although
its utility for users is demonstrated in NOTE 14.16); section 2.3.5
describes how any error propagates to all images (in a coarray program).
There is no notion of fault-tolerance here.
- UPC and OpenSHMEM say nothing about fault-tolerance.  I suspect that UPC
never will and OpenSHMEM will try to learn from the MPI Forum.
- C++14 draft (N3936) chapter 19 is all about exceptions and errors.
 30.2.2 talks about thread failure.
- IB 1.0 7.12.2 and 7.12.3, among other places.

Most specifications that I've read don't have all of the baggage we do, but
only because most of their objects are not so stateful or long-lived.  Now,
if MPI was based upon connections rather than communicators...

>      *   Ryan asked about the general usefulness of this proposal in terms
> of how an application would be able to respond to information about whether
> an error is fatal or not.
>         *   He asserts that error classes should generally be descriptive
> enough without it and if they aren't, the error class itself should be
> improved.



> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

Jeff Hammond
jeff.science at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20151214/e733497f/attachment-0001.html>

More information about the mpiwg-ft mailing list