[mpiwg-ft] MPI Forum Wrap-Up
wesley.bland at intel.com
Fri Dec 11 16:09:50 CST 2015
I’ve put together some notes on the goings on around our working group at the forum. You can find them all on the wiki page:
Since I know that the click-through is not always practical, I’ll copy them below.
* Went over reading for plenary time
* Aurelien and Keita presented some of the results of the ULFM BoF at SC
* Attendance was great
* There were a few questions and suggestions to improve the proposal.
* Aurelien is creating issues for suggests that we will act on.
* We discussed an overall view of what fault tolerance and error handling means in the context of MPI and how we cover each area as a standard
* We divided applications into a few buckets:
* Current applications - This describes the vast majority of applications which require that all process remain alive and recovery tends to be global.
* These apps tend to use/require recovery very similar to checkpoint/restart.
* They probably don't derive a lot of benefit from ULFM-like recovery models, but could potentially benefit from improved error handlers.
* In-memory Checkpoint/Restart - These apps can use in-memory checkpointing to improve both checkpoint and recovery times. They usually need to replace failed processes, but don't require that all remain alive.
* ULFM is a possibility here, but can result in bad locality without a library which will automatically move processes around after a failure.
* Reinit / multi-init/finalize with improved PMPI would also work. There are some proposals going on or that have gone on which could also provide the needed functionality. In these proposals, most of the locality problems would probably be pushed into the MPI library when initialized again.
* New applications - These apps tend to be able to run with fewer processes. They cover apps like tasking models, master/worker apps, and traditionally non-MPI apps that might be interested in the future (Hadoop, etc.).
* ULFM generally would apply well to these applications as locality is less important if processes are not being replaced.
* There are also errors that do not include process failures:
* Memory errors
* These could be detected by anything, but ULFM revoke could help with notification.
* Lots of SDC research is out there that sits on top of MPI.
* Network errors
* These tend to be masked by the implementation or promoted to process failures
* Resource exhaustion
* These sorts of errors cover out of memory, out of context IDs, etc.
* They can be improved with better error handlers/codes/classes
* Discussed some new topics related to error handling and error codes/classes
* Pavan expressed interest in error codes saying whether they were catastrophic or not.
* This resulted mpi-forum/mpi-issues#28<https://github.com/mpi-forum/mpi-issues/issues/28> where we add a new call MPI_ERROR_IS_CATASTROPHIC.
* Read the error handler cleanup tickets mpi-forum/mpi-issues#1<https://github.com/mpi-forum/mpi-issues/issues/1> and mpi-forum/mpi-issues#3<https://github.com/mpi-forum/mpi-issues/issues/3>.
* The forum didn't like where we removed all of the text about general errors. They considered some of it to still be valuable and should be updated. In particular, the example about MPI_RSEND could still be applicable if the implementation decides that it wants to return an error to the user because the MPI_RECV was not posted.
* We need to add text for MPI_INTERCOMM_CREATE.
* A few other minor things were added directly to the pull request.
* Read the MPI_COMM_FREE advice ticket.
* No concerns, will vote at next meeting.
* Presented the plenary about catastrophic errors.
* Few concerns were raised during the plenary. The main one was from Bill who says we should look at how other standards describe non-fatal errors when writing the text here.
* Ryan asked about the general usefulness of this proposal in terms of how an application would be able to respond to information about whether an error is fatal or not.
* He asserts that error classes should generally be descriptive enough without it and if they aren't, the error class itself should be improved.
More information about the mpiwg-ft