[mpiwg-ft] MPI Forum Wrap-Up
Jeff Hammond
jeff.science at gmail.com
Mon Dec 14 09:58:01 CST 2015
On Mon, Dec 14, 2015 at 7:11 AM, Bland, Wesley <wesley.bland at intel.com>
wrote:
> Hi Jeff. Thanks for the comments.
>
> On Dec 14, 2015, at 8:19 AM, Jeff Hammond <jeff.science at gmail.com<mailto:
> jeff.science at gmail.com>> wrote:
>
>
>
> On Fri, Dec 11, 2015 at 2:09 PM, Bland, Wesley <wesley.bland at intel.com
> <mailto:wesley.bland at intel.com>> wrote:
> Hi WG,
>
> I’ve put together some notes on the goings on around our working group at
> the forum. You can find them all on the wiki page:
>
> https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07
>
> Since I know that the click-through is not always practical, I’ll copy
> them below.
>
> Thanks,
> Wesley
>
> ====
>
> WG Meeting
>
> * Went over reading for plenary time
> * Aurelien and Keita presented some of the results of the ULFM BoF at
> SC
> * Attendance was great
> * There were a few questions and suggestions to improve the
> proposal.
> * Aurelien is creating issues for suggests that we will act on.
> * We discussed an overall view of what fault tolerance and error
> handling means in the context of MPI and how we cover each area as a
> standard
> * We divided applications into a few buckets:
> * Current applications - This describes the vast majority of
> applications which require that all process remain alive and recovery tends
> to be global.
> * These apps tend to use/require recovery very similar to
> checkpoint/restart.
> * They probably don't derive a lot of benefit from ULFM-like
> recovery models, but could potentially benefit from improved error handlers.
>
> Just remember that there can be ULFM + { hot spares -or-
> MPI_Comm_spawn(_multiple) } + checkpoint-restart, which preserves the size
> of the job.
>
> Lots of those fault-intolerant apps can do something like this with
> minimal changes. The apps that cannot use ULFM are the ones that require
> expensive message logging to be able to roll-back to a coherent state.
>
> Agreed. We had some discussion of this in the room. The gist of the
> conversation was that between ULFM and existing C/R, most of these cases
> are covered in a reasonable way.
>
>
Sorry I missed this discussion f2f. I had a reason, but I cannot remember
what it was.
>
> * In-memory Checkpoint/Restart - These apps can use in-memory
> checkpointing to improve both checkpoint and recovery times. They usually
> need to replace failed processes, but don't require that all remain alive.
> * ULFM is a possibility here, but can result in bad locality
> without a library which will automatically move processes around after a
> failure.
>
> Maybe. I'm not sure good locality exists on fat-tree and dragonfly
> topologies. And users can always renumber their ranks and move data around
> if they know topological placement matters.
>
> That’s true. What we were talking about was having MPI be able to do that
> for you. It’s always possible for applications to do this themselves (as
> Aurelien pointed out in the room), but sometimes its such a pain that if
> MPI already has that information, it might be better to just provide it.
>
>
> * Reinit / multi-init/finalize with improved PMPI would also
> work. There are some proposals going on or that have gone on which could
> also provide the needed functionality. In these proposals, most of the
> locality problems would probably be pushed into the MPI library when
> initialized again.
> * New applications - These apps tend to be able to run with
> fewer processes. They cover apps like tasking models, master/worker apps,
> and traditionally non-MPI apps that might be interested in the future
> (Hadoop, etc.).
>
> Lots of current applications are master-worker…
>
> Sure. New here could just mean less than 30 years old. :)
>
>
Are you calling me old? :-)
>
> * ULFM generally would apply well to these applications as
> locality is less important if processes are not being replaced.
> * There are also errors that do not include process failures:
> * Memory errors
>
> It is useful to distinguish causes and effects here. Memory errors are a
> cause. Process failure is one effect. Another effect is non-fatal data
> corruption, which may or may not be silent. Currently, we see that memory
> errors that are detected manifest as process failures, but hopefully
> someday the OS/RT people will figure out better things to do than just call
> abort(). Ok, to be fair, it's probably the firmware/driver people throwing
> the self-destruct lever…
>
> I should have been more precise. The errors we were talking about here are
> the ones that don’t result in process failure. We’re talking more about SDC
> types of errors.
>
Ah ok. Thanks.
>
>
> * These could be detected by anything, but ULFM revoke could
> help with notification.
> * Lots of SDC research is out there that sits on top of MPI.
> * Network errors
>
> I am curious how often this actually happens anymore. Aren't all modern
> networks capable of routing around permanently failed links? A switch
> failure should be fatal but that happens how often?
>
> Agreed. That’s why we didn’t focus on it too much. This is something that
> we generally push down to the implementation (or lower).
>
>
> * These tend to be masked by the implementation or promoted
> to process failures
> * Resource exhaustion
> * These sorts of errors cover out of memory, out of context
> IDs, etc.
> * They can be improved with better error
> handlers/codes/classes
>
> Yes, and this will be wildly useful. Lots of users lose jobs because they
> do dumb stuff with communicators that could be mitigated with slower
> fallback implementations over p2p (please ignore the apparent
> false-dichotomy here).
>
> * Discussed some new topics related to error handling and error
> codes/classes
> * Pavan expressed interest in error codes saying whether they were
> catastrophic or not.
> * This resulted mpi-forum/mpi-issues#28<
> https://github.com/mpi-forum/mpi-issues/issues/28> where we add a new
> call MPI_ERROR_IS_CATASTROPHIC.
>
> <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time>Plenary
> Time
>
> * Read the error handler cleanup tickets mpi-forum/mpi-issues#1<
> https://github.com/mpi-forum/mpi-issues/issues/1> and
> mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3>.
> * The forum didn't like where we removed all of the text about
> general errors. They considered some of it to still be valuable and should
> be updated. In particular, the example about MPI_RSEND could still be
> applicable if the implementation decides that it wants to return an error
> to the user because the MPI_RECV was not posted.
> * We need to add text for MPI_INTERCOMM_CREATE.
> * A few other minor things were added directly to the pull request.
> * Read the MPI_COMM_FREE advice ticket.
> * No concerns, will vote at next meeting.
> * Presented the plenary about catastrophic errors.
> * Few concerns were raised during the plenary. The main one was
> from Bill who says we should look at how other standards describe non-fatal
> errors when writing the text here.
>
> I'm skeptical that this is going to help us, but here are some references:
> - Fortran 2008 section 14.6 Halting (not going to be useful to us,
> although its utility for users is demonstrated in NOTE 14.16); section
> 2.3.5 describes how any error propagates to all images (in a coarray
> program). There is no notion of fault-tolerance here.
> - UPC and OpenSHMEM say nothing about fault-tolerance. I suspect that UPC
> never will and OpenSHMEM will try to learn from the MPI Forum.
> - C++14 draft (N3936) chapter 19 is all about exceptions and errors.
> 30.2.2 talks about thread failure.
> - IB 1.0 7.12.2 and 7.12.3, among other places.
>
> Most specifications that I've read don't have all of the baggage we do,
> but only because most of their objects are not so stateful or long-lived.
> Now, if MPI was based upon connections rather than communicators…
>
> Thanks for the pointers. Maybe after the new year, we can do some homework
> and see what lessons we can take from here. Note that we weren’t talking
> about (non)catastrophic in terms of fault tolerance. It was mostly just for
> correct error handling and to let MPI remain defined in a few more error
> states than it already does (none).
>
>
>0 is never a bad thing :-)
Jeff
> Thanks,
> Wesley
>
>
> * Ryan asked about the general usefulness of this proposal in terms
> of how an application would be able to respond to information about whether
> an error is fatal or not.
> * He asserts that error classes should generally be descriptive
> enough without it and if they aren't, the error class itself should be
> improved.
>
> Best,
>
> Jeff
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com<mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org<mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20151214/0422e4da/attachment-0001.html>
More information about the mpiwg-ft
mailing list