[mpiwg-ft] 2017 09 27 Meeting Notes

Bland, Wesley wesley.bland at intel.com
Wed Sep 27 14:52:46 CDT 2017

Here's the notes from today's WG con call.



  *   Intel - Wesley
  *   Argonne - Ken, Yanfei
  *   UTK - Aurelien
  *   ORNL - Geoffroy

Error Handlers

  *   Wesley made edits based on the feedback from the face-to-face.
     *   There are still a couple of very minor edits that need to be made

Process Fault Tolerance

  *   Is it possible to use ULFM and Reinit at the same time?
     *   Not sure how they can be composed (even if the smaller communicator used ULFM) because the error handler for the larger communicator is still likely to be triggered after a process failure, which would trigger reinit.
  *   We don't think it's a problem to use error handlers, but if using MPI_ERRORS_REINIT, it would need to be consistent across all communicators.
     *   We still like using error handlers better than an API call
        *   It doesn't create a new API interface
        *   Changing the error handler is already required for process fault tolerance anyway.

TODO Items

  *   Aurelien - Write first draft of ULFM composability/recovery advice to have libraries repair MPI in one place.
  *   Aurelien - Merge MPI_COMM_ISHRINK branch
  *   Aurelien - Go back over other ULFM branches so we can discuss them next time
  *   Wesley - Go back through ULFM RMA discussions to see what we need to do (if anything to move forward).
  *   Wesley - Improve slides for catastrophic errors to include example use cases
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20170927/3f0c1e28/attachment.html>

More information about the mpiwg-ft mailing list