[Mpi3-ft] Communicator Virtualization as a step forward
Nathan DeBardeleben
ndebard at lanl.gov
Thu Feb 12 13:46:29 CST 2009
Heh OK. :) Oh well, maybe my statements can then provide some
motivation for users that are interested in FT at very least :).
Sorry to pollute the conversation.
-- Nathan
---------------------------------------------------------------------
Nathan DeBardeleben, Ph.D.
Los Alamos National Laboratory
High Performance Computing Systems Integration (HPC-5)
phone: 505-667-3428
email: ndebard at lanl.gov
---------------------------------------------------------------------
Greg Bronevetsky wrote:
> I don't think that the users are suggesting that they don't want FT
> support. It sounds like they just don't value having the ability to
> reset the state of the MPI library without having to restart the
> applications. Since job schedulers can start up a large-scale
> application fairly quickly and they already use global checkpointing,
> I'm not surprised that they don't really care about this. In any case,
> our plans for the FT spec will allow for more capability than the
> FT-MPI spec. FT-MPI is tilted towards global synchronous recovery
> solutions, which will have scalability problems since every process
> must participate in recovery. Our goal with the FT specification is to
> allow localized recovery as well.
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>
> At 11:16 AM 2/12/2009, Nathan DeBardeleben wrote:
>> I really worry about taking the advise of users saying they would
>> rather terminate and restart an application than having some
>> assistance to help them ride through a problem. If they are worried
>> about programming language/model changes, I would encourage them to
>> open their eyes.
>> Major programming model changes are predicted for > petascale
>> computers and even petascale computers are having a hard time with
>> classical MPI programming. I think we're more likely to see MPI as
>> an underpinning of next-gen models. These users polled might not be
>> extreme-scale users, however.
>> Working at a laboratory positioning itself for exascale, we are
>> intimately aware of the fact that "oh just rerun it" is a worthless
>> conclusion. I wish I had more time to assist in this matter but our
>> laboratory has cracked down on participation in things that are not
>> directly associated with charge codes so it's a bit hard for me to
>> spend any sizable amount of time.
>>
>> Please though, consider the user base when they say things like that.
>> I'm sure Rich is well aware of these similar concerns. While MPI
>> fault tolerance might not be important to the users running 1000 node
>> systems, those of us approaching system mean time to interrupt under
>> an hour are quite on the opposite side of that spectrum.
>>
>> Are the small-system users pushing for FT to not be inside of MPI?
>> This is why I was so in favor of some sort of componentized MPI where
>> users could exclude FT if they weren't worried about reliability (and
>> thereby gain performance) but those of us who were in more dangerous
>> reliability regimes could take the performance penalty and compile in
>> / load in / configure in / whatever FT.
>>
>> -- Nathan
>>
>> ---------------------------------------------------------------------
>> Nathan DeBardeleben, Ph.D.
>> Los Alamos National Laboratory
>> High Performance Computing Systems Integration (HPC-5)
>> phone: 505-667-3428
>> email: ndebard at lanl.gov
>> ---------------------------------------------------------------------
>>
>>
>> Graham, Richard L. wrote:
>>> Josh,
>>> Very early on in the process we got feedback from users that an
>>> ft-mpi like interface was of no interest to them. They would just
>>> as soon terminate the application and restart rather than use this
>>> sort of approach. Having said that, there is already previous
>>> demonstration that the ft-mpi approach is useful for some
>>> applications. If you look closely at the spec, the ft-mpi approach
>>> is a subset. of the current subset.
>>> I am working on pulling out the api's and expanding the
>>> explanations. The goal is to have this out before the next telecon
>>> in two weeks.
>>> Prototyping is under way, with ut, cray, and ornl committed to
>>> working on this. Right now supporting infrastructure is being
>>> developed.
>>> Your point on the mpi 2 interfaces is good. A couple of people
>>> had started to look at this when it looked like this might make it
>>> into the 2.2 version. The changes seemed to be more extensive than
>>> expected, so work stopped. This does need to be picked up on.
>>>
>>> Rich
>>> ------Original Message------
>>> From: Josh Hursey
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> ReplyTo: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>>> Group
>>> Sent: Feb 12, 2009 8:31 AM
>>> Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>>>
>>> It is a good point that local communicator reconstruction operations
>>> require a fundamental change in the way communicators are handled by
>>> MPI. With that in mind it would probably take as much effort (if not
>>> more) to implement a virtualized version on top of MPI. So maybe it
>>> will not help as much as I had originally thought. Outside of the
>>> paper, do we have the interface and semantics of these operations
>>> described anywhere? I think that would help in trying to keep pace
>>> with the use cases.
>>>
>>> The spirit of the suggestion was as a way to separate what (I think)
>>> we can agree on as a first step (FT-MPI-like model) from the
>>> communicator reconstruction, which I see as a secondary step. If we
>>> stop to write up what the FT-MPI-like model should look like in the
>>> standard, then I think we can push forward on other fronts
>>> (prototyping of step 1, standardization of step 1, application
>>> implementations using step 1) while still trying to figure out how
>>> communication reconstruction should be expressed in the standard such
>>> that it is usable in target applications.
>>>
>>> So my motion is that the group explicitly focus effort on writing a
>>> document describing the FT-MPI-like model we consider as a
>>> foundation. Do so in the MPI standard language, and present it to the
>>> MPI Forum for a straw vote in the next couple of meetings. From this
>>> document we can continue evolving it to support more advanced
>>> features, like communicator reconstruction.
>>>
>>> I am willing to put effort into making such a document. However, I
>>> would like explicit support from the working group in pursing such an
>>> effort, and the help of anyone interested in helping write-up/define
>>> this specification.
>>>
>>> So what do people think taking this first step?
>>>
>>> -- Josh
>>>
>>>
>>> On Feb 11, 2009, at 5:57 PM, Greg Bronevetsky wrote:
>>>
>>>
>>>> I don't understand what you mean by "We can continue to pursue
>>>> communicator reconstruction interfaces though a virtualization
>>>> later above MPI." To me it seems that such interfaces will
>>>> effectively need to implement communicators on top of MPI in order
>>>> be operational, which will take about as much effort as
>>>> implementing them inside MPI. In particular, I don't see a way to
>>>> recreate a communicator using the MPI interface without making
>>>> collective calls. However, we're defining MPI_Rejoin (or whatever
>>>> its called) to be a local operation. This means that we cannot use
>>>> the MPI communicators interface and must instead implement our own
>>>> communicators.
>>>>
>>>> The bottom line is that it does make sense to start implementing
>>>> support for the FT-MPI model and evolve that to a more elaborate
>>>> model. However, I don't think that working on the rest above MPI
>>>> will save us any effort or time.
>>>>
>>>> Greg Bronevetsky
>>>> Post-Doctoral Researcher
>>>> 1028 Building 451
>>>> Lawrence Livermore National Lab
>>>> (925) 424-5756
>>>> bronevetsky1 at llnl.gov
>>>>
>>>> At 01:17 PM 2/11/2009, Josh Hursey wrote:
>>>>
>>>>> In our meeting yesterday, I was sitting in the back trying to take in
>>>>> the complexity of communicator recreation. It seems that much of the
>>>>> confusion at the moment is that we (at least I) are still not exactly
>>>>> sure how the interface should be defined and implemented.
>>>>>
>>>>> I think of the process fault tolerance specification as a series of
>>>>> steps that can be individually specified building upon each step
>>>>> while
>>>>> working towards a specific goal set. From this I was asking
>>>>> myself, is
>>>>> there any foundational concepts that we can define now so that folks
>>>>> can start implementation.
>>>>>
>>>>> That being said I suggest that we consider FT-MPI's model of all
>>>>> communicators except the base 3 (COMM_WORLD, COMM_SELF, COMM_NULL)
>>>>> are
>>>>> destroyed on a failure as the starting point for implementation. This
>>>>> would get us started. We can continue to pursue communicator
>>>>> reconstruction interfaces though a virtualization later above MPI. We
>>>>> can use this layer to experiment with the communicator recreation
>>>>> mechanisms in conjunction with applications while pursing the first
>>>>> step implementation. Once we start to agree on the interface for
>>>>> communicator reconstruction, then we can start to push it into the
>>>>> MPI
>>>>> standard/library for a better standard/implementation.
>>>>>
>>>>> The communicator virtualization library is a staging area for these
>>>>> interface ideas that we seem to be struggling with. The
>>>>> virtualization
>>>>>
>>>
>>> ------Original Message Truncated------
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list