[mpiwg-ft] FTWG 2017-10-18 Notes
wesley.bland at intel.com
Thu Oct 19 12:59:41 CDT 2017
The notes from the meeting are now posted:
* Intel - Wesley, Jim
* UTK - Aurelien
* ORNL - Geoffroy
* Argonne - Yanfei, Ken
* Sandia - Keita
* LLNL - Murali, Ignacio
## RMA Fault Tolerance (Data Resilience)
[Link to pull request](https://github.com/mpiwg-ft/mpi-standard/pull/4)
Continued discussion of whether this work is useful.
* Aurelien - The description of the failure model is unclear. We need to better differentiate between `MPI_ERR_PROC_FAILED` and `MPI_ERR_DATA_UNAVAILABLE`.
* Jim - Should `MPI_ERR_DATA_UNAVAILABLE` be usable outside of RMA? Does it apply to point-to-point or collectives?
* Jim/Aurelien - Is the justification for this work that flush doesn't allow detection of process failure? They're still not convinced that this is true.
* As long as flush can complete successfully, do we really need to tell you if a process failed on the other end?
* Jim - On the other hand, it might be true that we can't guarantee any process failure detection in _any_ RMA operation. Maybe we should just not allow process failure errors (as opposed to "upgrading" other types of errors to process failure).
* Jim - One place this still makes sense as is is having a process with data corrupted because another process failed during a put. If a third process is reading the bad memory, it _could_ get `MPI_ERR_DATA_UNAVAILABLE` instead of `MPI_ERR_PROC_FAILED`.
### Bottom Line
* We're still unclear on the failure model expected here. We probably need to get more feedback from Jeff.
* We also aren't convinced that process failure semantics aren't sufficient to tell the user all of the actionable information that they need.
## For next week:
* Get feedback from Jeff when he comes back from paternity leave.
## In the future:
* Start discussing text for FA-MPI and Reinit.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft