[Mpi3-ft] Con Call on 1/4/2009
Richard Graham
rlgraham at ornl.gov
Wed Jan 21 10:14:07 CST 2009
One of the items I do think we need to consider in this context is ways for
letting the different layers coordinate. Right now, the way we have
specified the layered recovery mode we don't provide any ability within the
standard to coordinate between layers (order callbacks, and ?). We need to
go down the path of thinking what one might do along these lines that is
general purpose, and whether or not this adds any sort of help in the
overall recovery process. I have not had time to think about this, but am
tossing this out 45 minutes before the call :-)
Rich
On 1/21/09 11:05 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov> wrote:
> Actually, something came up and I won't be able to make this con call
> either. However, I hope that we can have a good discussion over
> email. While I did not document the protocol in full algorithmic
> detail, I can point people to some of my prior papers such as
> http://greg.bronevetsky.com/papers/2003PPoPP.pdf as evidence that
> systems like this that work above MPI have successfully been
> implemented in the past.
>
> Josh, you make a good point that relying on heavy-weight support in
> order to make the API useful suggests that the API is insufficient.
> My point in drafting my solution is that the only component required
> to make the API useful is one that does not need additional API
> modifications and thus, does not need to be standardized. Of course,
> we could put in the appropriate components into the official API but
> that would make it weaker rather than stronger. I can point to the
> ABI debate as an analogy for what I'm suggesting here. Option 1 is to
> standardize the ABI. Option 2 is to develop a heavy-weight support
> tool like MorphMPI that makes the current API more useful (i.e.
> resolves most of the issues that motivate the ABI) and does not
> require any additional standardization. I think that most people
> agree that Option 2 is the better choice.
>
> The point here is that Fault Tolerance is hard and it requires a lot
> of runtime support to make application-implemented fault tolerance
> reasonably easy. We've identified much of the critical functionality
> that must be inside MPI in order for such functionality to be enjoyed
> by applications. We are now also identifying possible additional
> libraries that may also be necessary to make the API usable but that
> do not need to be standardized. In particular, we will probably come
> up with several such libraries that support a variety of application
> types (consider the minimal support required for monte carlo codes or
> non-modular codes), which is further evidence that this component,
> while important, should not be standardized.
>
> Greg Bronevetsky
> Post-Doctoral Researcher
> 1028 Building 451
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky1 at llnl.gov
>
> At 07:04 AM 1/21/2009, Josh Hursey wrote:
>> I am not going to be able to make it to today's call due to travel.
>>
>> My primary concern is that the proposal relies a bit too heavily on
>> some flavor of checkpointing or message logging in order to make the
>> interface useful. There should be a set of guidelines that make the
>> interface useful without a form of checkpointing or message logging on
>> the system. Though I think the door should always be open to these
>> types of additional functionality, but as far as the base
>> specification I think it should be usable without them.
>>
>> Best,
>> Josh
>>
>> P.S. I should have a revised interface for the following proposal in
>> the next week or so:
>> https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>>
>> On Jan 20, 2009, at 6:54 PM, Greg Bronevetsky wrote:
>>
>>> Here's my quick writeup of the major problems that we discussed with
>>> writing modular apps on top of our proposed MPI fault tolerance spec
>>> and an approach for making it relatively easy to write module-
>>> specific error recovery algorithms without worrying about other
>>> modules. I've attached a pdf version as well as a txt version that
>>> will be easier to edit.
>>>
>>> Greg Bronevetsky
>>> Post-Doctoral Researcher
>>> 1028 Building 451
>>> Lawrence Livermore National Lab
>>> (925) 424-5756
>>> bronevetsky1 at llnl.gov
>>>
>>> At 06:58 PM 1/13/2009, Richard Graham wrote:
>>>> OK, we will resume the calls next week, 1/21/2009.
>>>>
>>>> Rich
>>>>
>>>>
>>>> On 1/13/09 11:42 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov>
>>>> wrote:
>>>>
>>>>>
>>>>>> Unfortunately, for reasons out of [my] control, I did not manage
>>>> to
>>>>>> get the time to update the wiki and I doubt I will find any time
>>>>>> before the call tomorrow. I'll have time to get back to this
>>>> starting
>>>>>> from tomorrow morning.
>>>>>>
>>>>>> I second your idea to cancel the call tomorrow.
>>>>>
>>>>> I have a protocol worked out to do micro-rollbacks that will work
>>>>> well if we add to the API some kind of asynchronous event
>>>>> notification mechanism like active messages. It will work not as
>>>> well
>>>>> without the extension. I'll update George's document once its
>>>> posted
>>>>> so that we have a unified document that describes the problem and
>>>> the
>>>>> proposed solutions.
>>>>>
>>>>> Greg Bronevetsky
>>>>> Post-Doctoral Researcher
>>>>> 1028 Building 451
>>>>> Lawrence Livermore National Lab
>>>>> (925) 424-5756
>>>>> bronevetsky1 at llnl.gov
>>>>>
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> <Support for Developing Fault Tolerant Modular MPI
>>> Applications.pdf><Support for Developing Fault Tolerant Modular MPI
>>> Applications.txt>_______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list