[Mpi3-ft] Fault recovery of multiple communication libraries

Thu Feb 12 15:07:41 CST 2009

Hello,

We have not considered mixed communication models yet, and I am not  
sure we need to do so.
Let's call A the process that fails, and B the set of processes that  
should be notified. Consider the two following cases:
  - A also uses MPI to communicate with processes from B. Then, when a  
process from B tries to communicate with A, it will be notified by MPI  
of the error
  - A does not use MPI to communicate with processes from B. Then, the  
application does not need help from MPI to deal with the failure: MPI  
is not broken, thus does not have to be mended.

Thomas


Le 12 févr. 09 à 14:46, Krishnamoorthy, Sriram a écrit :

> I would like to understand how MPI can be notified of a failure.
> Consider another communication library (ARMCI/GASnet/...)  
> identifying a
> failure through its own mechanisms. In the current model, how can it
> notify MPI to verify/reconfigure/recover from the error, for example  
> by
> performing an MPI communication to the failed process?
>
> Conversely, can a communication library register to be notified of an
> error that MPI identifies and recovers from, so that that the library
> can take appropriate action?
>
> Sriram.K
>
>
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org
> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of
> mpi3-ft-request at lists.mpi-forum.org
> Sent: Thursday, February 12, 2009 11:17 AM
> To: mpi3-ft at lists.mpi-forum.org
> Subject: mpi3-ft Digest, Vol 13, Issue 4
>
> Send mpi3-ft mailing list submissions to
> 	mpi3-ft at lists.mpi-forum.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> or, via email, send a message with subject or body 'help' to
> 	mpi3-ft-request at lists.mpi-forum.org
>
> You can reach the person managing the list at
> 	mpi3-ft-owner at lists.mpi-forum.org
>
> When replying, please edit your Subject line so it is more specific  
> than
> "Re: Contents of mpi3-ft digest..."
>
>
> Today's Topics:
>
>   1. Re: Communicator Virtualization as a step forward (Josh Hursey)
>   2. Re: Communicator Virtualization as a step forward
>      (Graham, Richard L.)
>   3. Re: Communicator Virtualization as a step forward (George  
> Bosilca)
>   4. Re: Communicator Virtualization as a step forward
>      (Nathan DeBardeleben)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 12 Feb 2009 12:23:49 -0500
> From: Josh Hursey <jjhursey at open-mpi.org>
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
> 	Group"	<mpi3-ft at lists.mpi-forum.org>
> Message-ID: <4554DA1F-BEF7-4F03-8B1C-5B5BF2783477 at open-mpi.org>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Yeah I was planning on using that document as a starting point. I  
> wanted
> to look over it again and see if anything needs to change given some  
> of
> the discussions that we have been having in the group. It may also  
> need
> some additional language about MPI2 interfaces. It has been while  
> since
> I have looked over this particular document. I would also like to add
> some more control for process startup, but we may decide to take  
> that on
> as a secondary step.
>
> Does UTK still have the LaTeX for that document somewhere? Do you know
> if Graham would be interested in participating in this development?
>
> Cheers,
> Josh
>
> On Feb 12, 2009, at 10:43 AM, George Bosilca wrote:
>
>> Josh, the document that you talk about already exist. It was  
>> published
>
>> in ISC'04. Here is the
>> link:http://www.netlib.org/utk/people/JackDongarra/PAPERS/isc2004- 
>> FT-M
>> PI.pdf
>>
>> george.
>
>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 12 Feb 2009 14:01:15 -0500
> From: "Graham, Richard L." <rlgraham at ornl.gov>
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> To: mpi3-ft at lists.mpi-forum.org
> Message-ID:
> 	<537C6C0940C6C143AA46A88946B854170F0FC367 at ORNLEXCHANGE.ornl.gov>
> Content-Type: text/plain; charset=UTF-8
>
> Josh,
>  Very early on in the process we got feedback from users that an ft- 
> mpi
> like interface was of no interest to them.  They would just as soon
> terminate the application and restart rather than use this sort of
> approach.  Having said that, there is already previous demonstration
> that the ft-mpi approach is useful for some applications.  If you look
> closely at the spec, the ft-mpi approach is a subset. of the current
> subset.
>  I am working on pulling out the api's and expanding the explanations.
> The goal is to have this out before the next telecon in two weeks.
>  Prototyping is under way, with ut, cray, and ornl committed to  
> working
> on this.  Right now supporting infrastructure is being developed.
>  Your point on the mpi 2 interfaces is good.  A couple of people had
> started to look at this when it looked like this might make it into  
> the
> 2.2 version.  The changes seemed to be more extensive than expected,  
> so
> work stopped.  This does need to be picked up on.
>
> Rich
> ------Original Message------
> From: Josh Hursey
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> Group
> Sent: Feb 12, 2009 8:31 AM
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
> It is a good point that local communicator reconstruction operations
> require a fundamental change in the way communicators are handled by
> MPI. With that in mind it would probably take as much effort (if not
> more) to implement a virtualized version on top of MPI. So maybe it  
> will
> not help as much as I had originally thought. Outside of the paper, do
> we have the interface and semantics of these operations described
> anywhere? I think that would help in trying to keep pace with the use
> cases.
>
> The spirit of the suggestion was as a way to separate what (I think)  
> we
> can agree on as a first step (FT-MPI-like model) from the communicator
> reconstruction, which I see as a secondary step. If we stop to write  
> up
> what the FT-MPI-like model should look like in the standard, then I
> think we can push forward on other fronts (prototyping of step 1,
> standardization of step 1, application implementations using step 1)
> while still trying to figure out how communication reconstruction  
> should
> be expressed in the standard such that it is usable in target
> applications.
>
> So my motion is that the group explicitly focus effort on writing a
> document describing the FT-MPI-like model we consider as a foundation.
> Do so in the MPI standard language, and present it to the MPI Forum  
> for
> a straw vote in the next couple of meetings. From this document we can
> continue evolving it to support more advanced features, like
> communicator reconstruction.
>
> I am willing to put effort into making such a document. However, I  
> would
> like explicit support from the working group in pursing such an  
> effort,
> and the help of anyone interested in helping write-up/define this
> specification.
>
> So what do people think taking this first step?
>
> -- Josh
>
>
> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>
>> I don't understand what you mean by "We can continue to pursue
>> communicator reconstruction interfaces though a virtualization later
>> above MPI."  To me it seems that such interfaces will effectively  
>> need
>
>> to implement communicators on top of MPI in order be operational,
>> which will take about as much effort as implementing them inside MPI.
>> In particular, I don't see a way to recreate a communicator using the
>> MPI interface without making collective calls. However, we're  
>> defining
>
>> MPI_Rejoin (or whatever its called) to be a local operation. This
>> means that we cannot use the MPI communicators interface and must
>> instead implement our own communicators.
>>
>> The bottom line is that it does make sense to start implementing
>> support for the FT-MPI model and evolve that to a more elaborate
>> model. However, I don't think that working on the rest above MPI will
>> save us any effort or time.
>>
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>>
>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>> In our meeting yesterday, I was sitting in the back trying to take  
>>> in
>
>>> the complexity of communicator recreation. It seems that much of the
>>> confusion at the moment is that we (at least I) are still not  
>>> exactly
>
>>> sure how the interface should be defined and implemented.
>>>
>>> I think of the process fault tolerance specification as a series of
>>> steps that can be individually specified building upon each step
>>> while working towards a specific goal set. From this I was asking
>>> myself, is there any foundational concepts that we can define now so
>>> that folks can start implementation.
>>>
>>> That being said I suggest that we consider FT-MPI's model of all
>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL)
>>> are destroyed on a failure as the starting point for implementation.
>>> This would get us started. We can continue to pursue communicator
>>> reconstruction interfaces though a virtualization later above MPI.  
>>> We
>
>>> can use this layer to experiment with the communicator recreation
>>> mechanisms in conjunction with applications while pursing the first
>>> step implementation. Once we start to agree on the interface for
>>> communicator reconstruction, then we can start to push it into the
>>> MPI standard/library for a better standard/implementation.
>>>
>>> The communicator virtualization library is a staging area for these
>>> interface ideas that we seem to be struggling with. The
>>> virtualization
>
> ------Original Message Truncated------
>
>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 12 Feb 2009 14:16:09 -0500
> From: George Bosilca <bosilca at eecs.utk.edu>
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
> 	Group"	<mpi3-ft at lists.mpi-forum.org>
> Message-ID: <CD306C06-5662-4060-9E95-255852FE7BB1 at eecs.utk.edu>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> I don't necessarily agree with the statement that FT-MPI is a subset  
> of
> the current spec. As the current spec can be implemented on top of
> FT-MPI (with help from the PMPI interface), this tend to prove the
> opposite.
>
> However, I agree there are several features in the current spec that
> were not covered by the FT-MPI spec, but these features can be
> implemented on top of FT-MPI. As far as I understood, this is what  
> Josh
> proposed, as this will give a quick start (i.e. FT-MPI  
> implementation is
> already available).
>
>   george.
>
> On Feb 12, 2009, at 14:01 , Graham, Richard L. wrote:
>
>> Josh,
>> Very early on in the process we got feedback from users that an ft-
>> mpi like interface was of no interest to them.  They would just as
>> soon terminate the application and restart rather than use this sort
>> of approach.  Having said that, there is already previous
>> demonstration that the ft-mpi approach is useful for some
>> applications.  If you look closely at the spec, the ft-mpi approach  
>> is
>
>> a subset. of the current subset.
>> I am working on pulling out the api's and expanding the explanations.
>
>> The goal is to have this out before the next telecon in two weeks.
>> Prototyping is under way, with ut, cray, and ornl committed to
>> working on this.  Right now supporting infrastructure is being
>> developed.
>> Your point on the mpi 2 interfaces is good.  A couple of people had
>> started to look at this when it looked like this might make it into
>> the 2.2 version.  The changes seemed to be more extensive than
>> expected, so work stopped.  This does need to be picked up on.
>>
>> Rich
>> ------Original Message------
>> From: Josh Hursey
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>> Group
>> Sent: Feb 12, 2009 8:31 AM
>> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>>
>> It is a good point that local communicator reconstruction operations
>> require a fundamental change in the way communicators are handled by
>> MPI. With that in mind it would probably take as much effort (if not
>> more) to implement a virtualized version on top of MPI. So maybe it
>> will not help as much as I had originally thought. Outside of the
>> paper, do we have the interface and semantics of these operations
>> described anywhere? I think that would help in trying to keep pace
>> with the use cases.
>>
>> The spirit of the suggestion was as a way to separate what (I think)
>> we can agree on as a first step (FT-MPI-like model) from the
>> communicator reconstruction, which I see as a secondary step. If we
>> stop to write up what the FT-MPI-like model should look like in the
>> standard, then I think we can push forward on other fronts
>> (prototyping of step 1, standardization of step 1, application
>> implementations using step 1) while still trying to figure out how
>> communication reconstruction should be expressed in the standard such
>> that it is usable in target applications.
>>
>> So my motion is that the group explicitly focus effort on writing a
>> document describing the FT-MPI-like model we consider as a  
>> foundation.
>
>> Do so in the MPI standard language, and present it to the MPI Forum
>> for a straw vote in the next couple of meetings. From this document  
>> we
>
>> can continue evolving it to support more advanced features, like
>> communicator reconstruction.
>>
>> I am willing to put effort into making such a document. However, I
>> would like explicit support from the working group in pursing such an
>> effort, and the help of anyone interested in helping write-up/define
>> this specification.
>>
>> So what do people think taking this first step?
>>
>> -- Josh
>>
>>
>> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>>
>>> I don't understand what you mean by "We can continue to pursue
>>> communicator reconstruction interfaces though a virtualization later
>>> above MPI."  To me it seems that such interfaces will effectively
>>> need to implement communicators on top of MPI in order be
>>> operational, which will take about as much effort as implementing
>>> them inside MPI. In particular, I don't see a way to recreate a
>>> communicator using the MPI interface without making collective  
>>> calls.
>
>>> However, we're defining MPI_Rejoin (or whatever its called) to be a
>>> local operation. This means that we cannot use the MPI communicators
>>> interface and must instead implement our own communicators.
>>>
>>> The bottom line is that it does make sense to start implementing
>>> support for the FT-MPI model and evolve that to a more elaborate
>>> model. However, I don't think that working on the rest above MPI  
>>> will
>
>>> save us any effort or time.
>>>
>>> Greg Bronevetsky
>>> Post-Doctoral Researcher
>>> 1028 Building 451
>>> Lawrence Livermore National Lab
>>> (925) 424-5756
>>> bronevetsky1 at llnl.gov
>>>
>>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>>> In our meeting yesterday, I was sitting in the back trying to take
>>>> in the complexity of communicator recreation. It seems that much of
>>>> the confusion at the moment is that we (at least I) are still not
>>>> exactly sure how the interface should be defined and implemented.
>>>>
>>>> I think of the process fault tolerance specification as a series of
>>>> steps that can be individually specified building upon each step
>>>> while working towards a specific goal set. From this I was asking
>>>> myself, is there any foundational concepts that we can define now  
>>>> so
>
>>>> that folks can start implementation.
>>>>
>>>> That being said I suggest that we consider FT-MPI's model of all
>>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL)
>>>> are destroyed on a failure as the starting point for  
>>>> implementation.
>>>> This
>>>> would get us started. We can continue to pursue communicator
>>>> reconstruction interfaces though a virtualization later above MPI.
>>>> We
>>>> can use this layer to experiment with the communicator recreation
>>>> mechanisms in conjunction with applications while pursing the first
>>>> step implementation. Once we start to agree on the interface for
>>>> communicator reconstruction, then we can start to push it into the
>>>> MPI standard/library for a better standard/implementation.
>>>>
>>>> The communicator virtualization library is a staging area for these
>>>> interface ideas that we seem to be struggling with. The
>>>> virtualization
>>
>> ------Original Message Truncated------
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 12 Feb 2009 12:16:23 -0700
> From: Nathan DeBardeleben <ndebard at lanl.gov>
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
> 	Group"	<mpi3-ft at lists.mpi-forum.org>
> Message-ID: <49947587.6070604 at lanl.gov>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> I really worry about taking the advise of users saying they would  
> rather
> terminate and restart an application than having some assistance to  
> help
> them ride through a problem.  If they are worried about programming
> language/model changes, I would encourage them to open their eyes.
> Major programming model changes are predicted for > petascale  
> computers
> and even petascale computers are having a hard time with classical MPI
> programming.  I think we're more likely to see MPI as an  
> underpinning of
> next-gen models.  These users polled might not be extreme-scale users,
> however.
>
> Working at a laboratory positioning itself for exascale, we are
> intimately aware of the fact that "oh just rerun it" is a worthless
> conclusion.  I wish I had more time to assist in this matter but our
> laboratory has cracked down on participation in things that are not
> directly associated with charge codes so it's a bit hard for me to  
> spend
> any sizable amount of time.
>
> Please though, consider the user base when they say things like that.
> I'm sure Rich is well aware of these similar concerns.  While MPI  
> fault
> tolerance might not be important to the users running 1000 node  
> systems,
> those of us approaching system mean time to interrupt under an hour  
> are
> quite on the opposite side of that spectrum.
>
> Are the small-system users pushing for FT to not be inside of MPI?   
> This
> is why I was so in favor of some sort of componentized MPI where users
> could exclude FT if they weren't worried about reliability (and  
> thereby
> gain performance) but those of us who were in more dangerous  
> reliability
> regimes could take the performance penalty and compile in / load in /
> configure in / whatever FT.
>
> -- Nathan
>
> ---------------------------------------------------------------------
> Nathan DeBardeleben, Ph.D.
> Los Alamos National Laboratory
> High Performance Computing Systems Integration (HPC-5)
> phone: 505-667-3428
> email: ndebard at lanl.gov
> ---------------------------------------------------------------------
>
>
>
> Graham, Richard L. wrote:
>> Josh,
>>  Very early on in the process we got feedback from users that an
> ft-mpi like interface was of no interest to them.  They would just as
> soon terminate the application and restart rather than use this sort  
> of
> approach.  Having said that, there is already previous demonstration
> that the ft-mpi approach is useful for some applications.  If you look
> closely at the spec, the ft-mpi approach is a subset. of the current
> subset.
>>  I am working on pulling out the api's and expanding the
> explanations.  The goal is to have this out before the next telecon in
> two weeks.
>>  Prototyping is under way, with ut, cray, and ornl committed to
> working on this.  Right now supporting infrastructure is being
> developed.
>>  Your point on the mpi 2 interfaces is good.  A couple of people had
> started to look at this when it looked like this might make it into  
> the
> 2.2 version.  The changes seemed to be more extensive than expected,  
> so
> work stopped.  This does need to be picked up on.
>>
>> Rich
>> ------Original Message------
>> From: Josh Hursey
>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>> Group
>> Sent: Feb 12, 2009 8:31 AM
>> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>>
>> It is a good point that local communicator reconstruction operations
>> require a fundamental change in the way communicators are handled by
>> MPI. With that in mind it would probably take as much effort (if not
>> more) to implement a virtualized version on top of MPI. So maybe it
>> will not help as much as I had originally thought. Outside of the
>> paper, do we have the interface and semantics of these operations
>> described anywhere? I think that would help in trying to keep pace
>> with the use cases.
>>
>> The spirit of the suggestion was as a way to separate what (I think)
>> we can agree on as a first step (FT-MPI-like model) from the
>> communicator reconstruction, which I see as a secondary step. If we
>> stop to write up what the FT-MPI-like model should look like in the
>> standard, then I think we can push forward on other fronts
>> (prototyping of step 1, standardization of step 1, application
>> implementations using step 1) while still trying to figure out how
>> communication reconstruction should be expressed in the standard such
>> that it is usable in target applications.
>>
>> So my motion is that the group explicitly focus effort on writing a
>> document describing the FT-MPI-like model we consider as a  
>> foundation.
>
>> Do so in the MPI standard language, and present it to the MPI Forum
>> for a straw vote in the next couple of meetings. From this document  
>> we
>
>> can continue evolving it to support more advanced features, like
>> communicator reconstruction.
>>
>> I am willing to put effort into making such a document. However, I
>> would like explicit support from the working group in pursing such an
>> effort, and the help of anyone interested in helping write-up/define
>> this specification.
>>
>> So what do people think taking this first step?
>>
>> -- Josh
>>
>>
>> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>>
>>
>>> I don't understand what you mean by "We can continue to pursue
>>> communicator reconstruction interfaces though a virtualization later
>>> above MPI."  To me it seems that such interfaces will effectively
>>> need to implement communicators on top of MPI in order be
>>> operational, which will take about as much effort as implementing
>>> them inside MPI. In particular, I don't see a way to recreate a
>>> communicator using the MPI interface without making collective  
>>> calls.
>
>>> However, we're defining MPI_Rejoin (or whatever its called) to be a
>>> local operation. This means that we cannot use the MPI communicators
>>> interface and must instead implement our own communicators.
>>>
>>> The bottom line is that it does make sense to start implementing
>>> support for the FT-MPI model and evolve that to a more elaborate
>>> model. However, I don't think that working on the rest above MPI  
>>> will
>
>>> save us any effort or time.
>>>
>>> Greg Bronevetsky
>>> Post-Doctoral Researcher
>>> 1028 Building 451
>>> Lawrence Livermore National Lab
>>> (925) 424-5756
>>> bronevetsky1 at llnl.gov
>>>
>>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>>
>>>> In our meeting yesterday, I was sitting in the back trying to take
>>>> in the complexity of communicator recreation. It seems that much of
>>>> the confusion at the moment is that we (at least I) are still not
>>>> exactly sure how the interface should be defined and implemented.
>>>>
>>>> I think of the process fault tolerance specification as a series of
>>>> steps that can be individually specified building upon each step
>>>> while working towards a specific goal set. From this I was asking
>>>> myself, is there any foundational concepts that we can define now  
>>>> so
>
>>>> that folks can start implementation.
>>>>
>>>> That being said I suggest that we consider FT-MPI's model of all
>>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL)
>>>> are destroyed on a failure as the starting point for  
>>>> implementation.
>
>>>> This would get us started. We can continue to pursue communicator
>>>> reconstruction interfaces though a virtualization later above MPI.
>>>> We can use this layer to experiment with the communicator  
>>>> recreation
>
>>>> mechanisms in conjunction with applications while pursing the first
>>>> step implementation. Once we start to agree on the interface for
>>>> communicator reconstruction, then we can start to push it into the
>>>> MPI standard/library for a better standard/implementation.
>>>>
>>>> The communicator virtualization library is a staging area for these
>>>> interface ideas that we seem to be struggling with. The
>>>> virtualization
>>>>
>>
>> ------Original Message Truncated------
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>
>
> ------------------------------
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> End of mpi3-ft Digest, Vol 13, Issue 4
> **************************************
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>