[Mpi3-ft] Communicator Virtualization as a step forward

Thu Feb 12 15:33:11 CST 2009

On Feb 12, 2009, at 14:39 , Greg Bronevetsky wrote:

> I don't think that the users are suggesting that they don't want FT  
> support. It sounds like they just don't value having the ability to  
> reset the state of the MPI library without having to restart the  
> applications. Since job schedulers can start up a large-scale  
> application fairly quickly and they already use global  
> checkpointing, I'm not surprised that they don't really care about  
> this. In any case, our plans for the FT spec will allow for more  
> capability than the FT-MPI spec. FT-MPI is tilted towards global  
> synchronous recovery solutions, which will have scalability problems  
> since every process must participate in recovery.

Such statements are way to broad to be true. In fact it depends on  
what recovery mode was used. Please read the document I sent few  
emails ago, to see all the capabilities that FT-MPI provided.

> Our goal with the FT specification is to allow localized recovery as  
> well.

Again this is not true. First, one will need a kind of database to  
store this information (distributed or centralized) that came with its  
own scalability and cost problems. In addition, in the context of  
recovery the new processes will have to retrieve this information and  
let everybody else know not only their new contact information but the  
fact that they are  back in the specified communicator. Unfortunately,  
this is [again] _NOT_ a local operation. The fact that you seems to  
plan to delegate these problems to the runtime environment, doesn't  
make it local nor more scalable.

   george.

>
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>
> At 11:16 AM 2/12/2009, Nathan DeBardeleben wrote:
>> I really worry about taking the advise of users saying they would  
>> rather terminate and restart an application than having some  
>> assistance to help them ride through a problem.  If they are  
>> worried about programming language/model changes, I would encourage  
>> them to open their eyes.
>> Major programming model changes are predicted for > petascale  
>> computers and even petascale computers are having a hard time with  
>> classical MPI programming.  I think we're more likely to see MPI as  
>> an underpinning of next-gen models.  These users polled might not  
>> be extreme-scale users, however.
>> Working at a laboratory positioning itself for exascale, we are  
>> intimately aware of the fact that "oh just rerun it" is a worthless  
>> conclusion.  I wish I had more time to assist in this matter but  
>> our laboratory has cracked down on participation in things that are  
>> not directly associated with charge codes so it's a bit hard for me  
>> to spend any sizable amount of time.
>>
>> Please though, consider the user base when they say things like that.
>> I'm sure Rich is well aware of these similar concerns.  While MPI  
>> fault tolerance might not be important to the users running 1000  
>> node systems, those of us approaching system mean time to interrupt  
>> under an hour are quite on the opposite side of that spectrum.
>>
>> Are the small-system users pushing for FT to not be inside of MPI?   
>> This is why I was so in favor of some sort of componentized MPI  
>> where users could exclude FT if they weren't worried about  
>> reliability (and thereby gain performance) but those of us who were  
>> in more dangerous reliability regimes could take the performance  
>> penalty and compile in / load in / configure in / whatever FT.
>>
>> -- Nathan
>>
>> ---------------------------------------------------------------------
>> Nathan DeBardeleben, Ph.D.
>> Los Alamos National Laboratory
>> High Performance Computing Systems Integration (HPC-5)
>> phone: 505-667-3428
>> email: ndebard at lanl.gov
>> ---------------------------------------------------------------------
>>
>>
>> Graham, Richard L. wrote:
>>> Josh,
>>>  Very early on in the process we got feedback from users that an  
>>> ft-mpi like interface was of no interest to them.  They would just  
>>> as soon terminate the application and restart rather than use this  
>>> sort of approach.  Having said that, there is already previous  
>>> demonstration that the ft-mpi approach is useful for some  
>>> applications.  If you look closely at the spec, the ft-mpi  
>>> approach is a subset. of the current subset.
>>>  I am working on pulling out the api's and expanding the  
>>> explanations.  The goal is to have this out before the next  
>>> telecon in two weeks.
>>>  Prototyping is under way, with ut, cray, and ornl committed to  
>>> working on this.  Right now supporting infrastructure is being  
>>> developed.
>>>  Your point on the mpi 2 interfaces is good.  A couple of people  
>>> had started to look at this when it looked like this might make it  
>>> into the 2.2 version.  The changes seemed to be more extensive  
>>> than expected, so work stopped.  This does need to be picked up on.
>>>
>>> Rich
>>> ------Original Message------
>>> From: Josh Hursey
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working  
>>> Group
>>> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control  
>>> working Group
>>> Sent: Feb 12, 2009 8:31 AM
>>> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>>>
>>> It is a good point that local communicator reconstruction operations
>>> require a fundamental change in the way communicators are handled by
>>> MPI. With that in mind it would probably take as much effort (if not
>>> more) to implement a virtualized version on top of MPI. So maybe it
>>> will not help as much as I had originally thought. Outside of the
>>> paper, do we have the interface and semantics of these operations
>>> described anywhere? I think that would help in trying to keep pace
>>> with the use cases.
>>>
>>> The spirit of the suggestion was as a way to separate what (I think)
>>> we can agree on as a first step (FT-MPI-like model) from the
>>> communicator reconstruction, which I see as a secondary step. If we
>>> stop to write up what the FT-MPI-like model should look like in the
>>> standard, then I think we can push forward on other fronts
>>> (prototyping of step 1, standardization of step 1, application
>>> implementations using step 1) while still trying to figure out how
>>> communication reconstruction should be expressed in the standard  
>>> such
>>> that it is usable in target applications.
>>>
>>> So my motion is that the group explicitly focus effort on writing a
>>> document describing the FT-MPI-like model we consider as a
>>> foundation. Do so in the MPI standard language, and present it to  
>>> the
>>> MPI Forum for a straw vote in the next couple of meetings. From this
>>> document we can continue evolving it to support more advanced
>>> features, like communicator reconstruction.
>>>
>>> I am willing to put effort into making such a document. However, I
>>> would like explicit support from the working group in pursing such  
>>> an
>>> effort, and the help of anyone interested in helping write-up/define
>>> this specification.
>>>
>>> So what do people think taking this first step?
>>>
>>> -- Josh
>>>
>>>
>>> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>>>
>>>
>>>> I don't understand what you mean by "We can continue to pursue
>>>> communicator reconstruction interfaces though a virtualization
>>>> later above MPI."  To me it seems that such interfaces will
>>>> effectively need to implement communicators on top of MPI in order
>>>> be operational, which will take about as much effort as
>>>> implementing them inside MPI. In particular, I don't see a way to
>>>> recreate a communicator using the MPI interface without making
>>>> collective calls. However, we're defining MPI_Rejoin (or whatever
>>>> its called) to be a local operation. This means that we cannot use
>>>> the MPI communicators interface and must instead implement our own
>>>> communicators.
>>>>
>>>> The bottom line is that it does make sense to start implementing
>>>> support for the FT-MPI model and evolve that to a more elaborate
>>>> model. However, I don't think that working on the rest above MPI
>>>> will save us any effort or time.
>>>>
>>>> Greg Bronevetsky
>>>> Post-Doctoral Researcher
>>>> 1028 Building 451
>>>> Lawrence Livermore National Lab
>>>> (925) 424-5756
>>>> bronevetsky1 at llnl.gov
>>>>
>>>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>>>
>>>>> In our meeting yesterday, I was sitting in the back trying to  
>>>>> take in
>>>>> the complexity of communicator recreation. It seems that much of  
>>>>> the
>>>>> confusion at the moment is that we (at least I) are still not  
>>>>> exactly
>>>>> sure how the interface should be defined and implemented.
>>>>>
>>>>> I think of the process fault tolerance specification as a series  
>>>>> of
>>>>> steps that can be individually specified building upon each step
>>>>> while
>>>>> working towards a specific goal set. From this I was asking
>>>>> myself, is
>>>>> there any foundational concepts that we can define now so that  
>>>>> folks
>>>>> can start implementation.
>>>>>
>>>>> That being said I suggest that we consider FT-MPI's model of all
>>>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL)
>>>>> are
>>>>> destroyed on a failure as the starting point for implementation.  
>>>>> This
>>>>> would get us started. We can continue to pursue communicator
>>>>> reconstruction interfaces though a virtualization later above  
>>>>> MPI. We
>>>>> can use this layer to experiment with the communicator recreation
>>>>> mechanisms in conjunction with applications while pursing the  
>>>>> first
>>>>> step implementation. Once we start to agree on the interface for
>>>>> communicator reconstruction, then we can start to push it into the
>>>>> MPI
>>>>> standard/library for a better standard/implementation.
>>>>>
>>>>> The communicator virtualization library is a staging area for  
>>>>> these
>>>>> interface ideas that we seem to be struggling with. The
>>>>> virtualization
>>>>>
>>>
>>> ------Original Message Truncated------
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft