[mpiwg-ft] 2017 09 27 Meeting Notes
wesley.bland at intel.com
Wed Sep 27 14:52:46 CDT 2017
Here's the notes from today's WG con call.
* Intel - Wesley
* Argonne - Ken, Yanfei
* UTK - Aurelien
* ORNL - Geoffroy
* Wesley made edits based on the feedback from the face-to-face.
* There are still a couple of very minor edits that need to be made
Process Fault Tolerance
* Is it possible to use ULFM and Reinit at the same time?
* Not sure how they can be composed (even if the smaller communicator used ULFM) because the error handler for the larger communicator is still likely to be triggered after a process failure, which would trigger reinit.
* We don't think it's a problem to use error handlers, but if using MPI_ERRORS_REINIT, it would need to be consistent across all communicators.
* We still like using error handlers better than an API call
* It doesn't create a new API interface
* Changing the error handler is already required for process fault tolerance anyway.
* Aurelien - Write first draft of ULFM composability/recovery advice to have libraries repair MPI in one place.
* Aurelien - Merge MPI_COMM_ISHRINK branch
* Aurelien - Go back over other ULFM branches so we can discuss them next time
* Wesley - Go back through ULFM RMA discussions to see what we need to do (if anything to move forward).
* Wesley - Improve slides for catastrophic errors to include example use cases
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft