[mpiwg-ft] 2017 10 11 · mpiwg-ft/ft-issues Wiki

Bland, Wesley wesley.bland at intel.com
Wed Oct 11 11:46:24 CDT 2017


The notes for today's call are available on the wiki.

> https://github.com/mpiwg-ft/ft-issues/wiki/2017-10-11 <https://github.com/mpiwg-ft/ft-issues/wiki/2017-10-11>

A badly formatted version is copied below.

Thanks,
Wesley

===

Attendees

	• Intel - Wesley
	• ORNL - Geoffroy
	• Argonne - Yanfei, Ken
	• Auburn - Nawrin
	• Sandia - Keita
	• UTC - Tony
RMA Fault Tolerance (Data Resilience)

Link to pull request - https://github.com/mpiwg-ft/mpi-standard/pull/4

Summary of previous discussion

	• Jeff - RMA is different from communicator-based FT because it is more data focused and it is more expensive and less likely to detect process failure. We should add more text to focus on conveying that the data is unavailable.
	• Others - This is a bit out of scope of the initial ULFM proposal but still important. Maybe this should be an accompanying proposal
Discussion on today's call

	• Wesley - After reading through the proposal again, I think it makes sense to bring this into ULFM proper. It completes the picture for RMA because if we can detect process failure, we do, but we can also express failure in other, cheaper ways.
	• Keita - The expected recovery model is unclear here.
		• Good point: Need to add some advice to say that we expect the user to free the window, fix the data and recreate the window. They may or may not discover a process failure during this procedure.
	• Wesley - The advice about MPI_WIN_FREE needs to be expanded to cover MPI_DATA_UNAVAILABLE around lines 418-419.
ULFM

Aurelien merged the proposal to detect when a communicator is revoked. This is now part of ULFM proper.

For next week:

	• All - Go over the MPI_ERR_DATA_UNAVAILABLE proposal text and leave comments. Specifically look at new text proposals in the comments.


More information about the mpiwg-ft mailing list