[mpiwg-ft] MPI_Comm_revoke behavior
bosilca at icl.utk.edu
Thu Dec 5 10:37:30 CST 2013
On Dec 5, 2013, at 15:14 , Richard Graham <richardg at mellanox.com> wrote:
> [rich] the original intent was to allow for full restoration of communicators after failure, with minimal impact on those ranks that did not fail (don’t want to get into what that means now …). Those goals were reduced for pragmatic reasons.
The goals were not reduced, ULFM is a completely new approach based on a pragmatic design. To emphasize what Wesley suggested, ULFM is not an all-encompassing solution (unlike previous proposals). Instead is a __minimalistic__ set of building blocks for stabilization and recovery allowing the construction of more complex FT mechanism. So far, the exploration of such complementary FT approaches have remained in the real of research, outside the WG scope.
> I want to make sure that when/if there is work continued in this direction, the current proposal does not preclude this. One of the issues raised to me recently is that after a revoke one will not be able to accomplish such a goal on the remaining ranks – e.g., ranks will be reassigned. I am following up very specifically on this question.
Ongoing research to provide message logging, transactions, FT-MPI like and other complex protocols on top of ULFM have shows that the current approach provides a workable and portable set of primitive. The effort to provide full recovery of a communicator should follow the same approach before becoming a potential candidate for consideration in the WG.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft