[mpiwg-ft] MPI Forum Wrap-Up

Jeff Hammond jeff.science at gmail.com
Mon Dec 14 09:58:01 CST 2015


On Mon, Dec 14, 2015 at 7:11 AM, Bland, Wesley <wesley.bland at intel.com>
wrote:

> Hi Jeff. Thanks for the comments.
>
> On Dec 14, 2015, at 8:19 AM, Jeff Hammond <jeff.science at gmail.com<mailto:
> jeff.science at gmail.com>> wrote:
>
>
>
> On Fri, Dec 11, 2015 at 2:09 PM, Bland, Wesley <wesley.bland at intel.com
> <mailto:wesley.bland at intel.com>> wrote:
> Hi WG,
>
> I’ve put together some notes on the goings on around our working group at
> the forum. You can find them all on the wiki page:
>
> https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07
>
> Since I know that the click-through is not always practical, I’ll copy
> them below.
>
> Thanks,
> Wesley
>
> ====
>
> WG Meeting
>
>   *   Went over reading for plenary time
>   *   Aurelien and Keita presented some of the results of the ULFM BoF at
> SC
>      *   Attendance was great
>      *   There were a few questions and suggestions to improve the
> proposal.
>         *   Aurelien is creating issues for suggests that we will act on.
>   *   We discussed an overall view of what fault tolerance and error
> handling means in the context of MPI and how we cover each area as a
> standard
>      *   We divided applications into a few buckets:
>         *   Current applications - This describes the vast majority of
> applications which require that all process remain alive and recovery tends
> to be global.
>            *   These apps tend to use/require recovery very similar to
> checkpoint/restart.
>            *   They probably don't derive a lot of benefit from ULFM-like
> recovery models, but could potentially benefit from improved error handlers.
>
> Just remember that there can be ULFM + { hot spares -or-
> MPI_Comm_spawn(_multiple) } + checkpoint-restart, which preserves the size
> of the job.
>
> Lots of those fault-intolerant apps can do something like this with
> minimal changes.  The apps that cannot use ULFM are the ones that require
> expensive message logging to be able to roll-back to a coherent state.
>
> Agreed. We had some discussion of this in the room. The gist of the
> conversation was that between ULFM and existing C/R, most of these cases
> are covered in a reasonable way.
>
>
Sorry I missed this discussion f2f.  I had a reason, but I cannot remember
what it was.


>
>         *   In-memory Checkpoint/Restart - These apps can use in-memory
> checkpointing to improve both checkpoint and recovery times. They usually
> need to replace failed processes, but don't require that all remain alive.
>            *   ULFM is a possibility here, but can result in bad locality
> without a library which will automatically move processes around after a
> failure.
>
> Maybe.  I'm not sure good locality exists on fat-tree and dragonfly
> topologies.  And users can always renumber their ranks and move data around
> if they know topological placement matters.
>
> That’s true. What we were talking about was having MPI be able to do that
> for you. It’s always possible for applications to do this themselves (as
> Aurelien pointed out in the room), but sometimes its such a pain that if
> MPI already has that information, it might be better to just provide it.
>
>
>            *   Reinit / multi-init/finalize with improved PMPI would also
> work. There are some proposals going on or that have gone on which could
> also provide the needed functionality. In these proposals, most of the
> locality problems would probably be pushed into the MPI library when
> initialized again.
>         *   New applications - These apps tend to be able to run with
> fewer processes. They cover apps like tasking models, master/worker apps,
> and traditionally non-MPI apps that might be interested in the future
> (Hadoop, etc.).
>
> Lots of current applications are master-worker…
>
> Sure. New here could just mean less than 30 years old. :)
>
>
Are you calling me old? :-)


>
>            *   ULFM generally would apply well to these applications as
> locality is less important if processes are not being replaced.
>      *   There are also errors that do not include process failures:
>         *   Memory errors
>
> It is useful to distinguish causes and effects here.  Memory errors are a
> cause.  Process failure is one effect.  Another effect is non-fatal data
> corruption, which may or may not be silent.  Currently, we see that memory
> errors that are detected manifest as process failures, but hopefully
> someday the OS/RT people will figure out better things to do than just call
> abort().  Ok, to be fair, it's probably the firmware/driver people throwing
> the self-destruct lever…
>
> I should have been more precise. The errors we were talking about here are
> the ones that don’t result in process failure. We’re talking more about SDC
> types of errors.
>

Ah ok.  Thanks.


>
>
>            *   These could be detected by anything, but ULFM revoke could
> help with notification.
>            *   Lots of SDC research is out there that sits on top of MPI.
>         *   Network errors
>
> I am curious how often this actually happens anymore.  Aren't all modern
> networks capable of routing around permanently failed links?  A switch
> failure should be fatal but that happens how often?
>
> Agreed. That’s why we didn’t focus on it too much. This is something that
> we generally push down to the implementation (or lower).
>
>
>            *   These tend to be masked by the implementation or promoted
> to process failures
>         *   Resource exhaustion
>            *   These sorts of errors cover out of memory, out of context
> IDs, etc.
>            *   They can be improved with better error
> handlers/codes/classes
>
> Yes, and this will be wildly useful.  Lots of users lose jobs because they
> do dumb stuff with communicators that could be mitigated with slower
> fallback implementations over p2p (please ignore the apparent
> false-dichotomy here).
>
>   *   Discussed some new topics related to error handling and error
> codes/classes
>      *   Pavan expressed interest in error codes saying whether they were
> catastrophic or not.
>         *   This resulted mpi-forum/mpi-issues#28<
> https://github.com/mpi-forum/mpi-issues/issues/28> where we add a new
> call MPI_ERROR_IS_CATASTROPHIC.
>
> <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time>Plenary
> Time
>
>   *   Read the error handler cleanup tickets mpi-forum/mpi-issues#1<
> https://github.com/mpi-forum/mpi-issues/issues/1> and
> mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3>.
>      *   The forum didn't like where we removed all of the text about
> general errors. They considered some of it to still be valuable and should
> be updated. In particular, the example about MPI_RSEND could still be
> applicable if the implementation decides that it wants to return an error
> to the user because the MPI_RECV was not posted.
>      *   We need to add text for MPI_INTERCOMM_CREATE.
>      *   A few other minor things were added directly to the pull request.
>   *   Read the MPI_COMM_FREE advice ticket.
>      *   No concerns, will vote at next meeting.
>   *   Presented the plenary about catastrophic errors.
>      *   Few concerns were raised during the plenary. The main one was
> from Bill who says we should look at how other standards describe non-fatal
> errors when writing the text here.
>
> I'm skeptical that this is going to help us, but here are some references:
> - Fortran 2008 section 14.6 Halting (not going to be useful to us,
> although its utility for users is demonstrated in NOTE 14.16); section
> 2.3.5 describes how any error propagates to all images (in a coarray
> program).  There is no notion of fault-tolerance here.
> - UPC and OpenSHMEM say nothing about fault-tolerance.  I suspect that UPC
> never will and OpenSHMEM will try to learn from the MPI Forum.
> - C++14 draft (N3936) chapter 19 is all about exceptions and errors.
> 30.2.2 talks about thread failure.
> - IB 1.0 7.12.2 and 7.12.3, among other places.
>
> Most specifications that I've read don't have all of the baggage we do,
> but only because most of their objects are not so stateful or long-lived.
> Now, if MPI was based upon connections rather than communicators…
>
> Thanks for the pointers. Maybe after the new year, we can do some homework
> and see what lessons we can take from here. Note that we weren’t talking
> about (non)catastrophic in terms of fault tolerance. It was mostly just for
> correct error handling and to let MPI remain defined in a few more error
> states than it already does (none).
>
>
>0 is never a bad thing :-)

Jeff


> Thanks,
> Wesley
>
>
>      *   Ryan asked about the general usefulness of this proposal in terms
> of how an application would be able to respond to information about whether
> an error is fatal or not.
>         *   He asserts that error classes should generally be descriptive
> enough without it and if they aren't, the error class itself should be
> improved.
>
> Best,
>
> Jeff
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com<mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20151214/0422e4da/attachment-0001.html>


More information about the mpiwg-ft mailing list