[mpiwg-ft] FTWG Con Call 2018-03-21 Meeting Notes
wesley.bland at intel.com
Wed Mar 21 11:42:30 CDT 2018
Here's the notes from today's meeting:
Note that we are cancelling next week's call due to the SC submission deadline and to give us time to work on reorganizing the text for the proposals.
• Intel - Wesley, Rob
• Argonne - Yanfei
• UTK - Aurelien
• LLNL - Ignacio
• Noncatastrophic Errors and Error Handling Wrapup
• ULFM Plans
Non Catastrophic Errors
• Made some minor edits to the proposal based on feedback from February 2018 meeting
• We'll need to read for a no-no vote at the June 2018 meeting
Error Handling Wrap-up
• 1st vote passed even after discussion of intercommunicator error handling
• 2nd vote scheduled for June 2018 meeting
• Aurelien attempted a reading of the full ticket
• Feedback started by Martin but echoed by others in the forum (Dan, Tony, etc.) was that they are still uncomfortable with this proposal and would like to see it broken into multiple pieces:
• Error Notification and Discovery
• New error class: MPI_ERR_PROC_FAILED
• New API functions: MPI_COMM_FAILURE_ACK & MPI_COMM_FAILURE_GET_ACKED
• MPI_COMM_REVOKE & MPI_COMM_SHRINK
• These pieces are the most contentious, particularly for Martin who believe the asynchronous failure notification but synchronous recovery introduces a deadlock problem.
• 1 & 2 above can probably be accepted quickly and could form the basis of basic-FT to provide reliable a point-to-point model.
• We would need to keep working to figure out the best way to repair/replace communicators to enable collectives, RMA, Files, etc.
• Ignacio mentioned user that really only care about point-to-point so they can construct their own collectives.
• As a WG, we agreed to work on splitting ULFM as described above.
• Aurelien will start this work and bring it back to the WG for future discussion.
• We don't expect a lot of progress over the next two weeks during the SC submission period.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft