[mpiwg-ft] 2017-09-13 FTWG Con Call Notes

Bland, Wesley wesley.bland at intel.com
Wed Sep 13 15:13:27 CDT 2017


The notes for today's call are now posted on the wiki (https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13). I'm also going to copy them here to make it easier for those who are keeping up only via the mailing lists.

For next week's F2F, if there's anything you'd like to add to the agenda, please let me know. Also, there will be a webex for us to use. The information is posted on the meeting website (http://mpi-forum.org/meetings/2017/09/agenda). If you're planning to call in for a particular part and want us to cover something first or last, let me know. We'll do our best to accommodate, but I imagine that most of our time will be spent on process fault tolerance so it might be hard to carve out time for specific pieces.

Thanks,
Wesley

====
<https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13#agenda-for-f2f>Agenda for F2F

  *   WG Time
     *   Briefly? discuss catastrophic errors again
     *   Move forward on process failure (ULFM, Reinit, etc.)
  *   Reading
     *   Read error handlers

<https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13#con-call-notes>Con Call Notes
<https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13#error-handlers>Error Handlers

  *   Went over slides and PDF for reading
     *   Want to change one sentence in advice to implementors in Section 8.2.
     *   This should be a small enough change to be acceptable. Will point it out separately.

<https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13#catastrophic-errors>Catastrophic Errors

  *   Discussed current proposal and decided that we're still happy with it.
  *   Global state of MPI_GET_STATE is ok because if any thread is catastrophic, all threads are catastrophic and can't recover anyway.
     *   If you're checking the state, you're probably going to do it in an error handlers so you'll know which error code to look for to find out about the error.
  *   Bill Gropp was asking us to look at things like what POSIX does for errors, but it's difficult to replicate that in MPI because of the much larger amount of state that MPI has to maintain across multiple processes. POSIX is more local and stateless (or the state lives in the user's data).
  *   We might end up needing more error classes so we can give the user specific information about errors.
  *   Might be ready to move forward on a December reading here.

<https://github.com/mpiwg-ft/ft-issues/wiki/2017-09-13#process-failure>Process Failure

  *   Aurelien proposed adding a MPI_COMM_REVOKE_ALL function to resolve the deadlock problem with overlapping communicators.
     *   Others were skeptical because you might always have to assume that you need to revoke all communicators any time you have overlapping communication.
     *   Aurelien asserted that having concurrent communication with overlapping communicators is not common and might not be as bad as we think.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20170913/32bc3fb7/attachment.html>


More information about the mpiwg-ft mailing list