[mpiwg-ft] FTWG 2017-10-18 Notes

Bland, Wesley wesley.bland at intel.com
Thu Oct 19 12:59:41 CDT 2017


The notes from the meeting are now posted:

https://github.com/mpiwg-ft/ft-issues/wiki/2017-10-18

Thanks,
Wesley

===

### Attendees

* Intel - Wesley, Jim
* UTK - Aurelien
* ORNL - Geoffroy
* Argonne - Yanfei, Ken
* Sandia - Keita
* LLNL - Murali, Ignacio

## RMA Fault Tolerance (Data Resilience)

[Link to pull request](https://github.com/mpiwg-ft/mpi-standard/pull/4)

Continued discussion of whether this work is useful.

* Aurelien - The description of the failure model is unclear. We need to better differentiate between `MPI_ERR_PROC_FAILED` and `MPI_ERR_DATA_UNAVAILABLE`.
* Jim - Should `MPI_ERR_DATA_UNAVAILABLE` be usable outside of RMA? Does it apply to point-to-point or collectives?
* Jim/Aurelien - Is the justification for this work that flush doesn't allow detection of process failure? They're still not convinced that this is true.
  * As long as flush can complete successfully, do we really need to tell you if a process failed on the other end?
* Jim - On the other hand, it might be true that we can't guarantee any process failure detection in _any_ RMA operation. Maybe we should just not allow process failure errors (as opposed to "upgrading" other types of errors to process failure).
* Jim - One place this still makes sense as is is having a process with data corrupted because another process failed during a put. If a third process is reading the bad memory, it _could_ get `MPI_ERR_DATA_UNAVAILABLE` instead of `MPI_ERR_PROC_FAILED`.

### Bottom Line

* We're still unclear on the failure model expected here. We probably need to get more feedback from Jeff.
* We also aren't convinced that process failure semantics aren't sufficient to tell the user all of the actionable information that they need.

## For next week:
* Get feedback from Jeff when he comes back from paternity leave.

## In the future:
* Start discussing text for FA-MPI and Reinit.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20171019/9b669c01/attachment.html>


More information about the mpiwg-ft mailing list