[mpiwg-ft] FTWG Con Call 2018-03-21 Meeting Notes

Bland, Wesley wesley.bland at intel.com
Wed Mar 21 11:42:30 CDT 2018


Here's the notes from today's meeting:

Note that we are cancelling next week's call due to the SC submission deadline and to give us time to work on reorganizing the text for the proposals.

https://github.com/mpiwg-ft/ft-issues/wiki/2018-03-21

Thanks,
Wesley

Attendees

• Intel - Wesley, Rob
• Argonne - Yanfei
• UTK - Aurelien
• LLNL - Ignacio

Agenda

• Noncatastrophic Errors and Error Handling Wrapup
• ULFM Plans

Non Catastrophic Errors

• Made some minor edits to the proposal based on feedback from February 2018 meeting
• We'll need to read for a no-no vote at the June 2018 meeting

Error Handling Wrap-up

• 1st vote passed even after discussion of intercommunicator error handling
• 2nd vote scheduled for June 2018 meeting

ULFM

• Aurelien attempted a reading of the full ticket
• Feedback started by Martin but echoed by others in the forum (Dan, Tony, etc.) was that they are still uncomfortable with this proposal and would like to see it broken into multiple pieces:
• Error Notification and Discovery
• New error class:  MPI_ERR_PROC_FAILED
• New API functions: MPI_COMM_FAILURE_ACK & MPI_COMM_FAILURE_GET_ACKED
• Agreement
• MPI_COMM_AGREE
• Recovery
• MPI_COMM_REVOKE & MPI_COMM_SHRINK
• These pieces are the most contentious, particularly for Martin who believe the asynchronous failure notification but synchronous recovery introduces a deadlock problem.
• 1 & 2 above can probably be accepted quickly and could form the basis of basic-FT to provide reliable a point-to-point model.
• We would need to keep working to figure out the best way to repair/replace communicators to enable collectives, RMA, Files, etc.
• Ignacio mentioned user that really only care about point-to-point so they can construct their own collectives.
• As a WG, we agreed to work on splitting ULFM as described above.
• Aurelien will start this work and bring it back to the WG for future discussion.
• We don't expect a lot of progress over the next two weeks during the SC submission period.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20180321/49ab2d4f/attachment.html>


More information about the mpiwg-ft mailing list