[mpiwg-ft] MPI_Comm_revoke behavior

Richard Graham richardg at mellanox.com
Fri Dec 6 07:35:37 CST 2013

I would disagree with your characterization of the previous approach as anything but minimalistic.

Let's talk about this in the WG slot on Monday.  I have to say that in some way I totally missed the point that this is intended to "be it", so need to carefully re-evaluate the proposal in that light.  My main concern is that the standard is supposed to provide a means for supporting a  broad range of FT methodologies on top of this.  Need to make sure that some of the approaches people  do want to take are being blocked.  Also, concern had been expressed that the resulting behavior will prevent many users from using it, so need to talk through these issues (I will say that the behavior described to me is a show stopper for many users, so need to make sure there is not a misunderstanding).


From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of George Bosilca
Sent: Thursday, December 05, 2013 11:38 AM
To: MPI WG Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [mpiwg-ft] MPI_Comm_revoke behavior

On Dec 5, 2013, at 15:14 , Richard Graham <richardg at mellanox.com<mailto:richardg at mellanox.com>> wrote:

[rich] the original intent was to allow for full restoration of communicators after failure, with minimal impact on those ranks that did not fail (don't want to get into what that means now ...).  Those goals were reduced for pragmatic reasons.

The goals were not reduced, ULFM is a completely new approach based on a pragmatic design. To emphasize what Wesley suggested, ULFM is not an all-encompassing solution (unlike previous proposals). Instead is a __minimalistic__ set of building blocks for stabilization and recovery allowing the construction of more complex FT mechanism. So far, the exploration of such complementary FT approaches have remained in the real of research, outside the WG scope.

  I want to make sure that when/if there is work continued in this direction, the current proposal does not preclude  this.  One of  the issues raised to me recently is that after a revoke one will not be able to accomplish such a goal on the remaining ranks - e.g., ranks will be reassigned.  I am following up very specifically on this question.

Ongoing research to provide message logging, transactions, FT-MPI like and other complex protocols on top of ULFM have shows that the current approach provides a workable and portable set of primitive. The effort to provide full recovery of a communicator should follow the same approach before becoming a potential candidate for consideration in the WG.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131206/367ff97c/attachment-0001.html>

More information about the mpiwg-ft mailing list