[Mpi3-ft] Fault recovery of multiple communication libraries

Krishnamoorthy, Sriram sriram at pnl.gov
Thu Feb 12 13:46:38 CST 2009


I would like to understand how MPI can be notified of a failure.
Consider another communication library (ARMCI/GASnet/...) identifying a
failure through its own mechanisms. In the current model, how can it
notify MPI to verify/reconfigure/recover from the error, for example by
performing an MPI communication to the failed process?

Conversely, can a communication library register to be notified of an
error that MPI identifies and recovers from, so that that the library
can take appropriate action?

Sriram.K


-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org
[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of
mpi3-ft-request at lists.mpi-forum.org
Sent: Thursday, February 12, 2009 11:17 AM
To: mpi3-ft at lists.mpi-forum.org
Subject: mpi3-ft Digest, Vol 13, Issue 4

Send mpi3-ft mailing list submissions to
	mpi3-ft at lists.mpi-forum.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
or, via email, send a message with subject or body 'help' to
	mpi3-ft-request at lists.mpi-forum.org

You can reach the person managing the list at
	mpi3-ft-owner at lists.mpi-forum.org

When replying, please edit your Subject line so it is more specific than
"Re: Contents of mpi3-ft digest..."


Today's Topics:

   1. Re: Communicator Virtualization as a step forward (Josh Hursey)
   2. Re: Communicator Virtualization as a step forward
      (Graham, Richard L.)
   3. Re: Communicator Virtualization as a step forward (George Bosilca)
   4. Re: Communicator Virtualization as a step forward
      (Nathan DeBardeleben)


----------------------------------------------------------------------

Message: 1
Date: Thu, 12 Feb 2009 12:23:49 -0500
From: Josh Hursey <jjhursey at open-mpi.org>
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
	Group"	<mpi3-ft at lists.mpi-forum.org>
Message-ID: <4554DA1F-BEF7-4F03-8B1C-5B5BF2783477 at open-mpi.org>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

Yeah I was planning on using that document as a starting point. I wanted
to look over it again and see if anything needs to change given some of
the discussions that we have been having in the group. It may also need
some additional language about MPI2 interfaces. It has been while since
I have looked over this particular document. I would also like to add
some more control for process startup, but we may decide to take that on
as a secondary step.

Does UTK still have the LaTeX for that document somewhere? Do you know
if Graham would be interested in participating in this development?

Cheers,
Josh

On Feb 12, 2009, at 10:43 AM, George Bosilca wrote:

> Josh, the document that you talk about already exist. It was published

> in ISC'04. Here is the 
> link:http://www.netlib.org/utk/people/JackDongarra/PAPERS/isc2004-FT-M
> PI.pdf
>
> george.



------------------------------

Message: 2
Date: Thu, 12 Feb 2009 14:01:15 -0500
From: "Graham, Richard L." <rlgraham at ornl.gov>
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
To: mpi3-ft at lists.mpi-forum.org
Message-ID:
	<537C6C0940C6C143AA46A88946B854170F0FC367 at ORNLEXCHANGE.ornl.gov>
Content-Type: text/plain; charset=UTF-8

Josh,
  Very early on in the process we got feedback from users that an ft-mpi
like interface was of no interest to them.  They would just as soon
terminate the application and restart rather than use this sort of
approach.  Having said that, there is already previous demonstration
that the ft-mpi approach is useful for some applications.  If you look
closely at the spec, the ft-mpi approach is a subset. of the current
subset.
  I am working on pulling out the api's and expanding the explanations.
The goal is to have this out before the next telecon in two weeks.
  Prototyping is under way, with ut, cray, and ornl committed to working
on this.  Right now supporting infrastructure is being developed.
  Your point on the mpi 2 interfaces is good.  A couple of people had
started to look at this when it looked like this might make it into the
2.2 version.  The changes seemed to be more extensive than expected, so
work stopped.  This does need to be picked up on.

Rich
------Original Message------
From: Josh Hursey
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working
Group
Sent: Feb 12, 2009 8:31 AM
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward

It is a good point that local communicator reconstruction operations
require a fundamental change in the way communicators are handled by
MPI. With that in mind it would probably take as much effort (if not
more) to implement a virtualized version on top of MPI. So maybe it will
not help as much as I had originally thought. Outside of the paper, do
we have the interface and semantics of these operations described
anywhere? I think that would help in trying to keep pace with the use
cases.

The spirit of the suggestion was as a way to separate what (I think) we
can agree on as a first step (FT-MPI-like model) from the communicator
reconstruction, which I see as a secondary step. If we stop to write up
what the FT-MPI-like model should look like in the standard, then I
think we can push forward on other fronts (prototyping of step 1,
standardization of step 1, application implementations using step 1)
while still trying to figure out how communication reconstruction should
be expressed in the standard such that it is usable in target
applications.

So my motion is that the group explicitly focus effort on writing a
document describing the FT-MPI-like model we consider as a foundation.
Do so in the MPI standard language, and present it to the MPI Forum for
a straw vote in the next couple of meetings. From this document we can
continue evolving it to support more advanced features, like
communicator reconstruction.

I am willing to put effort into making such a document. However, I would
like explicit support from the working group in pursing such an effort,
and the help of anyone interested in helping write-up/define this
specification.

So what do people think taking this first step?

-- Josh


On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:

> I don't understand what you mean by "We can continue to pursue 
> communicator reconstruction interfaces though a virtualization later 
> above MPI."  To me it seems that such interfaces will effectively need

> to implement communicators on top of MPI in order be operational, 
> which will take about as much effort as implementing them inside MPI. 
> In particular, I don't see a way to recreate a communicator using the 
> MPI interface without making collective calls. However, we're defining

> MPI_Rejoin (or whatever its called) to be a local operation. This 
> means that we cannot use the MPI communicators interface and must 
> instead implement our own communicators.
>
> The bottom line is that it does make sense to start implementing 
> support for the FT-MPI model and evolve that to a more elaborate 
> model. However, I don't think that working on the rest above MPI will 
> save us any effort or time.
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>
> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>> In our meeting yesterday, I was sitting in the back trying to take in

>> the complexity of communicator recreation. It seems that much of the 
>> confusion at the moment is that we (at least I) are still not exactly

>> sure how the interface should be defined and implemented.
>>
>> I think of the process fault tolerance specification as a series of 
>> steps that can be individually specified building upon each step 
>> while working towards a specific goal set. From this I was asking 
>> myself, is there any foundational concepts that we can define now so 
>> that folks can start implementation.
>>
>> That being said I suggest that we consider FT-MPI's model of all 
>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL) 
>> are destroyed on a failure as the starting point for implementation. 
>> This would get us started. We can continue to pursue communicator 
>> reconstruction interfaces though a virtualization later above MPI. We

>> can use this layer to experiment with the communicator recreation 
>> mechanisms in conjunction with applications while pursing the first 
>> step implementation. Once we start to agree on the interface for 
>> communicator reconstruction, then we can start to push it into the 
>> MPI standard/library for a better standard/implementation.
>>
>> The communicator virtualization library is a staging area for these 
>> interface ideas that we seem to be struggling with. The 
>> virtualization

------Original Message Truncated------



------------------------------

Message: 3
Date: Thu, 12 Feb 2009 14:16:09 -0500
From: George Bosilca <bosilca at eecs.utk.edu>
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
	Group"	<mpi3-ft at lists.mpi-forum.org>
Message-ID: <CD306C06-5662-4060-9E95-255852FE7BB1 at eecs.utk.edu>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

I don't necessarily agree with the statement that FT-MPI is a subset of
the current spec. As the current spec can be implemented on top of
FT-MPI (with help from the PMPI interface), this tend to prove the
opposite.

However, I agree there are several features in the current spec that
were not covered by the FT-MPI spec, but these features can be
implemented on top of FT-MPI. As far as I understood, this is what Josh
proposed, as this will give a quick start (i.e. FT-MPI implementation is
already available).

   george.

On Feb 12, 2009, at 14:01 , Graham, Richard L. wrote:

> Josh,
>  Very early on in the process we got feedback from users that an ft- 
> mpi like interface was of no interest to them.  They would just as 
> soon terminate the application and restart rather than use this sort 
> of approach.  Having said that, there is already previous 
> demonstration that the ft-mpi approach is useful for some 
> applications.  If you look closely at the spec, the ft-mpi approach is

> a subset. of the current subset.
>  I am working on pulling out the api's and expanding the explanations.

> The goal is to have this out before the next telecon in two weeks.
>  Prototyping is under way, with ut, cray, and ornl committed to 
> working on this.  Right now supporting infrastructure is being 
> developed.
>  Your point on the mpi 2 interfaces is good.  A couple of people had 
> started to look at this when it looked like this might make it into 
> the 2.2 version.  The changes seemed to be more extensive than 
> expected, so work stopped.  This does need to be picked up on.
>
> Rich
> ------Original Message------
> From: Josh Hursey
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
> Group
> Sent: Feb 12, 2009 8:31 AM
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
> It is a good point that local communicator reconstruction operations 
> require a fundamental change in the way communicators are handled by 
> MPI. With that in mind it would probably take as much effort (if not
> more) to implement a virtualized version on top of MPI. So maybe it 
> will not help as much as I had originally thought. Outside of the 
> paper, do we have the interface and semantics of these operations 
> described anywhere? I think that would help in trying to keep pace 
> with the use cases.
>
> The spirit of the suggestion was as a way to separate what (I think) 
> we can agree on as a first step (FT-MPI-like model) from the 
> communicator reconstruction, which I see as a secondary step. If we 
> stop to write up what the FT-MPI-like model should look like in the 
> standard, then I think we can push forward on other fronts 
> (prototyping of step 1, standardization of step 1, application 
> implementations using step 1) while still trying to figure out how 
> communication reconstruction should be expressed in the standard such 
> that it is usable in target applications.
>
> So my motion is that the group explicitly focus effort on writing a 
> document describing the FT-MPI-like model we consider as a foundation.

> Do so in the MPI standard language, and present it to the MPI Forum 
> for a straw vote in the next couple of meetings. From this document we

> can continue evolving it to support more advanced features, like 
> communicator reconstruction.
>
> I am willing to put effort into making such a document. However, I 
> would like explicit support from the working group in pursing such an 
> effort, and the help of anyone interested in helping write-up/define 
> this specification.
>
> So what do people think taking this first step?
>
> -- Josh
>
>
> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>
>> I don't understand what you mean by "We can continue to pursue 
>> communicator reconstruction interfaces though a virtualization later 
>> above MPI."  To me it seems that such interfaces will effectively 
>> need to implement communicators on top of MPI in order be 
>> operational, which will take about as much effort as implementing 
>> them inside MPI. In particular, I don't see a way to recreate a 
>> communicator using the MPI interface without making collective calls.

>> However, we're defining MPI_Rejoin (or whatever its called) to be a 
>> local operation. This means that we cannot use the MPI communicators 
>> interface and must instead implement our own communicators.
>>
>> The bottom line is that it does make sense to start implementing 
>> support for the FT-MPI model and evolve that to a more elaborate 
>> model. However, I don't think that working on the rest above MPI will

>> save us any effort or time.
>>
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>>
>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>> In our meeting yesterday, I was sitting in the back trying to take 
>>> in the complexity of communicator recreation. It seems that much of 
>>> the confusion at the moment is that we (at least I) are still not 
>>> exactly sure how the interface should be defined and implemented.
>>>
>>> I think of the process fault tolerance specification as a series of 
>>> steps that can be individually specified building upon each step 
>>> while working towards a specific goal set. From this I was asking 
>>> myself, is there any foundational concepts that we can define now so

>>> that folks can start implementation.
>>>
>>> That being said I suggest that we consider FT-MPI's model of all 
>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL) 
>>> are destroyed on a failure as the starting point for implementation.
>>> This
>>> would get us started. We can continue to pursue communicator 
>>> reconstruction interfaces though a virtualization later above MPI.
>>> We
>>> can use this layer to experiment with the communicator recreation 
>>> mechanisms in conjunction with applications while pursing the first 
>>> step implementation. Once we start to agree on the interface for 
>>> communicator reconstruction, then we can start to push it into the 
>>> MPI standard/library for a better standard/implementation.
>>>
>>> The communicator virtualization library is a staging area for these 
>>> interface ideas that we seem to be struggling with. The 
>>> virtualization
>
> ------Original Message Truncated------
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft



------------------------------

Message: 4
Date: Thu, 12 Feb 2009 12:16:23 -0700
From: Nathan DeBardeleben <ndebard at lanl.gov>
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working
	Group"	<mpi3-ft at lists.mpi-forum.org>
Message-ID: <49947587.6070604 at lanl.gov>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

I really worry about taking the advise of users saying they would rather
terminate and restart an application than having some assistance to help
them ride through a problem.  If they are worried about programming
language/model changes, I would encourage them to open their eyes.  
Major programming model changes are predicted for > petascale computers
and even petascale computers are having a hard time with classical MPI
programming.  I think we're more likely to see MPI as an underpinning of
next-gen models.  These users polled might not be extreme-scale users,
however. 

Working at a laboratory positioning itself for exascale, we are
intimately aware of the fact that "oh just rerun it" is a worthless
conclusion.  I wish I had more time to assist in this matter but our
laboratory has cracked down on participation in things that are not
directly associated with charge codes so it's a bit hard for me to spend
any sizable amount of time.

Please though, consider the user base when they say things like that.  
I'm sure Rich is well aware of these similar concerns.  While MPI fault
tolerance might not be important to the users running 1000 node systems,
those of us approaching system mean time to interrupt under an hour are
quite on the opposite side of that spectrum.

Are the small-system users pushing for FT to not be inside of MPI?  This
is why I was so in favor of some sort of componentized MPI where users
could exclude FT if they weren't worried about reliability (and thereby
gain performance) but those of us who were in more dangerous reliability
regimes could take the performance penalty and compile in / load in /
configure in / whatever FT.

-- Nathan

---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
High Performance Computing Systems Integration (HPC-5)
phone: 505-667-3428
email: ndebard at lanl.gov
--------------------------------------------------------------------- 



Graham, Richard L. wrote:
> Josh,
>   Very early on in the process we got feedback from users that an
ft-mpi like interface was of no interest to them.  They would just as
soon terminate the application and restart rather than use this sort of
approach.  Having said that, there is already previous demonstration
that the ft-mpi approach is useful for some applications.  If you look
closely at the spec, the ft-mpi approach is a subset. of the current
subset.
>   I am working on pulling out the api's and expanding the
explanations.  The goal is to have this out before the next telecon in
two weeks.
>   Prototyping is under way, with ut, cray, and ornl committed to
working on this.  Right now supporting infrastructure is being
developed.
>   Your point on the mpi 2 interfaces is good.  A couple of people had
started to look at this when it looked like this might make it into the
2.2 version.  The changes seemed to be more extensive than expected, so
work stopped.  This does need to be picked up on.
>
> Rich
> ------Original Message------
> From: Josh Hursey
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
> Group
> Sent: Feb 12, 2009 8:31 AM
> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
> It is a good point that local communicator reconstruction operations 
> require a fundamental change in the way communicators are handled by 
> MPI. With that in mind it would probably take as much effort (if not
> more) to implement a virtualized version on top of MPI. So maybe it 
> will not help as much as I had originally thought. Outside of the 
> paper, do we have the interface and semantics of these operations 
> described anywhere? I think that would help in trying to keep pace 
> with the use cases.
>
> The spirit of the suggestion was as a way to separate what (I think) 
> we can agree on as a first step (FT-MPI-like model) from the 
> communicator reconstruction, which I see as a secondary step. If we 
> stop to write up what the FT-MPI-like model should look like in the 
> standard, then I think we can push forward on other fronts 
> (prototyping of step 1, standardization of step 1, application 
> implementations using step 1) while still trying to figure out how 
> communication reconstruction should be expressed in the standard such 
> that it is usable in target applications.
>
> So my motion is that the group explicitly focus effort on writing a 
> document describing the FT-MPI-like model we consider as a foundation.

> Do so in the MPI standard language, and present it to the MPI Forum 
> for a straw vote in the next couple of meetings. From this document we

> can continue evolving it to support more advanced features, like 
> communicator reconstruction.
>
> I am willing to put effort into making such a document. However, I 
> would like explicit support from the working group in pursing such an 
> effort, and the help of anyone interested in helping write-up/define 
> this specification.
>
> So what do people think taking this first step?
>
> -- Josh
>
>
> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>
>   
>> I don't understand what you mean by "We can continue to pursue 
>> communicator reconstruction interfaces though a virtualization later 
>> above MPI."  To me it seems that such interfaces will effectively 
>> need to implement communicators on top of MPI in order be 
>> operational, which will take about as much effort as implementing 
>> them inside MPI. In particular, I don't see a way to recreate a 
>> communicator using the MPI interface without making collective calls.

>> However, we're defining MPI_Rejoin (or whatever its called) to be a 
>> local operation. This means that we cannot use the MPI communicators 
>> interface and must instead implement our own communicators.
>>
>> The bottom line is that it does make sense to start implementing 
>> support for the FT-MPI model and evolve that to a more elaborate 
>> model. However, I don't think that working on the rest above MPI will

>> save us any effort or time.
>>
>> Greg Bronevetsky
>> Post-Doctoral Researcher
>> 1028 Building 451
>> Lawrence Livermore National Lab
>> (925) 424-5756
>> bronevetsky1 at llnl.gov
>>
>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>     
>>> In our meeting yesterday, I was sitting in the back trying to take 
>>> in the complexity of communicator recreation. It seems that much of 
>>> the confusion at the moment is that we (at least I) are still not 
>>> exactly sure how the interface should be defined and implemented.
>>>
>>> I think of the process fault tolerance specification as a series of 
>>> steps that can be individually specified building upon each step 
>>> while working towards a specific goal set. From this I was asking 
>>> myself, is there any foundational concepts that we can define now so

>>> that folks can start implementation.
>>>
>>> That being said I suggest that we consider FT-MPI's model of all 
>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL) 
>>> are destroyed on a failure as the starting point for implementation.

>>> This would get us started. We can continue to pursue communicator 
>>> reconstruction interfaces though a virtualization later above MPI. 
>>> We can use this layer to experiment with the communicator recreation

>>> mechanisms in conjunction with applications while pursing the first 
>>> step implementation. Once we start to agree on the interface for 
>>> communicator reconstruction, then we can start to push it into the 
>>> MPI standard/library for a better standard/implementation.
>>>
>>> The communicator virtualization library is a staging area for these 
>>> interface ideas that we seem to be struggling with. The 
>>> virtualization
>>>       
>
> ------Original Message Truncated------
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>   


------------------------------

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft


End of mpi3-ft Digest, Vol 13, Issue 4
**************************************




More information about the mpiwg-ft mailing list