[mpiwg-ft] MPI_Comm_revoke behavior

Aurélien Bouteiller bouteill at icl.utk.edu
Fri Dec 6 08:17:08 CST 2013


It certainly is a valid and important issue to be able to restore a deployment with rank isomorphism. Gladly the requirement that this scenario would be supported has been accounted for from day 1. It is one of the simple use cases that is already deployed by many users. I’ll present code snippets on monday. 

Aurelien 


Le 6 déc. 2013 à 09:15, Richard Graham <richardg at mellanox.com> a écrit :

> This was raised as a concern at SC be an expert in the field, and specifically the issue of preserving rank I'd.  We just need to follow up and ensure there is not a misunderstanding.
> 
> Rich
> 
> ------Original Message------
> From: Wesley Bland
> To: MPI WG Fault Tolerance and Dynamic Process Control working Group
> Cc: MPI WG Fault Tolerance and Dynamic Process Control working Group
> ReplyTo: MPI WG Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [mpiwg-ft] MPI_Comm_revoke behavior
> Sent: Dec 6, 2013 9:08 AM
> 
> Rich, 
> This is something we've discussed many times on the con calls and mailing list, but we can discuss it on Monday as well. Aurélien will also be presenting slides during the FT plenary time demonstrating sample use cases. We haven't yet come up with something that we'll be excluding with the current proposal.  
> Wesley  
> On Dec 6, 2013, at 7:47 AM, Richard Graham <richardg at mellanox.com> wrote:
> 
> I would disagree with your characterization of the previous approach as anything but minimalistic.
>  
> Let’s talk about this in the WG slot on Monday.  I have to say that in some way I totally missed the point that this is intended to “be it”, so need to carefully re-evaluate the proposal in that light.  My main concern is that the standard is supposed to provide a means for supporting a  broad range of FT methodologies on top of this.  Need to make sure that some of the approaches people  do want to take are being blocked.  Also, concern had been expressed that the resulting behavior will prevent many users from using it, so need to talk through these issues (I will say that the behavior described to me is a show stopper for many users, so need to make sure there is not a misunderstanding).
>  
> Rich
>  
> From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of George Bosilca
> Sent: Thursday, December 05, 2013 11:38 AM
> To: MPI WG Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [mpiwg-ft] MPI_Comm_revoke behavior
>  
>  
> On Dec 5, 2013, at 15:14 , Richard Graham <richardg at mellanox.com> wrote:
> 
> 
> 
> [rich] the original intent was to allow for full restoration of communicators after failure, with minimal impact on those ranks that did not fail (don’t want to get into what that means now …).  Those goals were reduced for pragmatic reasons.
>  
> The goals were not reduced, ULFM is a completely new approach based on a pragmatic design. To emphasize what Wesley suggested, ULFM is not an all-encompassing solution (unlike previous proposals). Instead is a __minimalistic__ set of building blocks for stabilization and recovery allowing the construction of more complex FT mechanism. So far, the exploration of such complementary FT approaches have remained in the real of research, outside the WG
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375










More information about the mpiwg-ft mailing list