[mpiwg-ft] MPI Forum Wrap-Up

Mon Dec 14 08:38:00 CST 2015

--
Aurélien Bouteiller, Ph.D. ~~ https://icl.cs.utk.edu/~bouteill/ <https://icl.cs.utk.edu/~bouteill/>
> Le 14 déc. 2015 à 09:19, Jeff Hammond <jeff.science at gmail.com> a écrit :
> 
> 
> 
> On Fri, Dec 11, 2015 at 2:09 PM, Bland, Wesley <wesley.bland at intel.com <mailto:wesley.bland at intel.com>>wrote:
> Hi WG,
> 
> I’ve put together some notes on the goings on around our working group at the forum. You can find them all on the wiki page:
> 
> https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07 <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07>
> 
> Since I know that the click-through is not always practical, I’ll copy them below.
> 
> Thanks,
> Wesley
> 
> ====
> 
> WG Meeting
> 
>   *   Went over reading for plenary time
>   *   Aurelien and Keita presented some of the results of the ULFM BoF at SC
>      *   Attendance was great
>      *   There were a few questions and suggestions to improve the proposal.
>         *   Aurelien is creating issues for suggests that we will act on.
>   *   We discussed an overall view of what fault tolerance and error handling means in the context of MPI and how we cover each area as a standard
>      *   We divided applications into a few buckets:
>         *   Current applications - This describes the vast majority of applications which require that all process remain alive and recovery tends to be global.
>            *   These apps tend to use/require recovery very similar to checkpoint/restart.
>            *   They probably don't derive a lot of benefit from ULFM-like recovery models, but could potentially benefit from improved error handlers.
> 
> Just remember that there can be ULFM + { hot spares -or- MPI_Comm_spawn(_multiple) } + checkpoint-restart, which preserves the size of the job.
> 
Totally agree with you. Some point was made that in some of these I/O based checkpointing apps, existing MPI standard, or only slightly modified MPI standard could be enough, which is also true (as long as you remain on the I/O based checkpointing). 

> Lots of those fault-intolerant apps can do something like this with minimal changes.  The apps that cannot use ULFM are the ones that require expensive message logging to be able to roll-back to a coherent state.

There has been some work to provided automatic C/R with message logging with ULFM, so that’s also possible (the paper claims to beat our own implementation of ML in Open MPI, when it compares to O2P, and that’s actually pretty good, if experiments are conducted correctly, because O2P is fast).
http://link.springer.com/chapter/10.1007%2F978-3-319-03859-9_27

>  
>         *   In-memory Checkpoint/Restart - These apps can use in-memory checkpointing to improve both checkpoint and recovery times. They usually need to replace failed processes, but don't require that all remain alive.
>            *   ULFM is a possibility here, but can result in bad locality without a library which will automatically move processes around after a failure.
> 
> Maybe.  I'm not sure good locality exists on fat-tree and dragonfly topologies.  And users can always renumber their ranks and move data around if they know topological placement matters.
>  
>            *   Reinit / multi-init/finalize with improved PMPI would also work. There are some proposals going on or that have gone on which could also provide the needed functionality. In these proposals, most of the locality problems would probably be pushed into the MPI library when initialized again.
>         *   New applications - These apps tend to be able to run with fewer processes. They cover apps like tasking models, master/worker apps, and traditionally non-MPI apps that might be interested in the future (Hadoop, etc.).
> 
> Lots of current applications are master-worker...
>  
>            *   ULFM generally would apply well to these applications as locality is less important if processes are not being replaced.
>      *   There are also errors that do not include process failures:
>         *   Memory errors
> 
> It is useful to distinguish causes and effects here.  Memory errors are a cause.  Process failure is one effect.  Another effect is non-fatal data corruption, which may or may not be silent.  Currently, we see that memory errors that are detected manifest as process failures, but hopefully someday the OS/RT people will figure out better things to do than just call abort().  Ok, to be fair, it's probably the firmware/driver people throwing the self-destruct lever...
>  
>            *   These could be detected by anything, but ULFM revoke could help with notification.
>            *   Lots of SDC research is out there that sits on top of MPI.
>         *   Network errors
> 
> I am curious how often this actually happens anymore.  Aren't all modern networks capable of routing around permanently failed links?  A switch failure should be fatal but that happens how often?
>  
>            *   These tend to be masked by the implementation or promoted to process failures
>         *   Resource exhaustion
>            *   These sorts of errors cover out of memory, out of context IDs, etc.
>            *   They can be improved with better error handlers/codes/classes
> 
> Yes, and this will be wildly useful.  Lots of users lose jobs because they do dumb stuff with communicators that could be mitigated with slower fallback implementations over p2p (please ignore the apparent false-dichotomy here).
>  
>   *   Discussed some new topics related to error handling and error codes/classes
>      *   Pavan expressed interest in error codes saying whether they were catastrophic or not.
>         *   This resulted mpi-forum/mpi-issues#28<https://github.com/mpi-forum/mpi-issues/issues/28 <https://github.com/mpi-forum/mpi-issues/issues/28>> where we add a new call MPI_ERROR_IS_CATASTROPHIC.
> 
> <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time <https://github.com/mpiwg-ft/ft-issues/wiki/2015-12-07#plenary-time>>Plenary Time
> 
>   *   Read the error handler cleanup tickets mpi-forum/mpi-issues#1<https://github.com/mpi-forum/mpi-issues/issues/1 <https://github.com/mpi-forum/mpi-issues/issues/1>> and mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3 <https://github.com/mpi-forum/mpi-issues/issues/3>>.
>      *   The forum didn't like where we removed all of the text about general errors. They considered some of it to still be valuable and should be updated. In particular, the example about MPI_RSEND could still be applicable if the implementation decides that it wants to return an error to the user because the MPI_RECV was not posted.
>      *   We need to add text for MPI_INTERCOMM_CREATE.
>      *   A few other minor things were added directly to the pull request.
>   *   Read the MPI_COMM_FREE advice ticket.
>      *   No concerns, will vote at next meeting.
>   *   Presented the plenary about catastrophic errors.
>      *   Few concerns were raised during the plenary. The main one was from Bill who says we should look at how other standards describe non-fatal errors when writing the text here.
> 
> I'm skeptical that this is going to help us, but here are some references:
> - Fortran 2008 section 14.6 Halting (not going to be useful to us, although its utility for users is demonstrated in NOTE 14.16); section 2.3.5 describes how any error propagates to all images (in a coarray program).  There is no notion of fault-tolerance here.
> - UPC and OpenSHMEM say nothing about fault-tolerance.  I suspect that UPC never will and OpenSHMEM will try to learn from the MPI Forum.
> - C++14 draft (N3936) chapter 19 is all about exceptions and errors.  30.2.2 talks about thread failure.
> - IB 1.0 7.12.2 and 7.12.3, among other places.
> 
> Most specifications that I've read don't have all of the baggage we do, but only because most of their objects are not so stateful or long-lived.  Now, if MPI was based upon connections rather than communicators…

Thanks for the links, very useful! 

Aurélien

>  
>      *   Ryan asked about the general usefulness of this proposal in terms of how an application would be able to respond to information about whether an error is fatal or not.
>         *   He asserts that error classes should generally be descriptive enough without it and if they aren't, the error class itself should be improved.
> 
> Best,
> 
> Jeff
>  
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft <http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft>
> 
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/ <http://jeffhammond.github.io/>_______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org <mailto:mpiwg-ft at lists.mpi-forum.org>
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft <http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20151214/a3a3034c/attachment-0001.html>