[Mpi3-ft] Con Call on 1/4/2009
Greg Bronevetsky
bronevetsky1 at llnl.gov
Wed Jan 21 10:05:07 CST 2009
Actually, something came up and I won't be able to make this con call
either. However, I hope that we can have a good discussion over
email. While I did not document the protocol in full algorithmic
detail, I can point people to some of my prior papers such as
http://greg.bronevetsky.com/papers/2003PPoPP.pdf as evidence that
systems like this that work above MPI have successfully been
implemented in the past.
Josh, you make a good point that relying on heavy-weight support in
order to make the API useful suggests that the API is insufficient.
My point in drafting my solution is that the only component required
to make the API useful is one that does not need additional API
modifications and thus, does not need to be standardized. Of course,
we could put in the appropriate components into the official API but
that would make it weaker rather than stronger. I can point to the
ABI debate as an analogy for what I'm suggesting here. Option 1 is to
standardize the ABI. Option 2 is to develop a heavy-weight support
tool like MorphMPI that makes the current API more useful (i.e.
resolves most of the issues that motivate the ABI) and does not
require any additional standardization. I think that most people
agree that Option 2 is the better choice.
The point here is that Fault Tolerance is hard and it requires a lot
of runtime support to make application-implemented fault tolerance
reasonably easy. We've identified much of the critical functionality
that must be inside MPI in order for such functionality to be enjoyed
by applications. We are now also identifying possible additional
libraries that may also be necessary to make the API usable but that
do not need to be standardized. In particular, we will probably come
up with several such libraries that support a variety of application
types (consider the minimal support required for monte carlo codes or
non-modular codes), which is further evidence that this component,
while important, should not be standardized.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
At 07:04 AM 1/21/2009, Josh Hursey wrote:
>I am not going to be able to make it to today's call due to travel.
>
>My primary concern is that the proposal relies a bit too heavily on
>some flavor of checkpointing or message logging in order to make the
>interface useful. There should be a set of guidelines that make the
>interface useful without a form of checkpointing or message logging on
>the system. Though I think the door should always be open to these
>types of additional functionality, but as far as the base
>specification I think it should be usable without them.
>
>Best,
>Josh
>
>P.S. I should have a revised interface for the following proposal in
>the next week or so:
> https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>
>On Jan 20, 2009, at 6:54 PM, Greg Bronevetsky wrote:
>
>>Here's my quick writeup of the major problems that we discussed with
>>writing modular apps on top of our proposed MPI fault tolerance spec
>>and an approach for making it relatively easy to write module-
>>specific error recovery algorithms without worrying about other
>>modules. I've attached a pdf version as well as a txt version that
>>will be easier to edit.
>>
>>Greg Bronevetsky
>>Post-Doctoral Researcher
>>1028 Building 451
>>Lawrence Livermore National Lab
>>(925) 424-5756
>>bronevetsky1 at llnl.gov
>>
>>At 06:58 PM 1/13/2009, Richard Graham wrote:
>>>OK, we will resume the calls next week, 1/21/2009.
>>>
>>>Rich
>>>
>>>
>>>On 1/13/09 11:42 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov>
>>>wrote:
>>>
>>> >
>>> >> Unfortunately, for reasons out of [my] control, I did not manage
>>>to
>>> >> get the time to update the wiki and I doubt I will find any time
>>> >> before the call tomorrow. I'll have time to get back to this
>>>starting
>>> >> from tomorrow morning.
>>> >>
>>> >> I second your idea to cancel the call tomorrow.
>>> >
>>> > I have a protocol worked out to do micro-rollbacks that will work
>>> > well if we add to the API some kind of asynchronous event
>>> > notification mechanism like active messages. It will work not as
>>>well
>>> > without the extension. I'll update George's document once its
>>>posted
>>> > so that we have a unified document that describes the problem and
>>>the
>>> > proposed solutions.
>>> >
>>> > Greg Bronevetsky
>>> > Post-Doctoral Researcher
>>> > 1028 Building 451
>>> > Lawrence Livermore National Lab
>>> > (925) 424-5756
>>> > bronevetsky1 at llnl.gov
>>> >
>>> > _______________________________________________
>>> > mpi3-ft mailing list
>>> > mpi3-ft at lists.mpi-forum.org
>>> > http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>_______________________________________________
>>>mpi3-ft mailing list
>>>mpi3-ft at lists.mpi-forum.org
>>>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>><Support for Developing Fault Tolerant Modular MPI
>>Applications.pdf><Support for Developing Fault Tolerant Modular MPI
>>Applications.txt>_______________________________________________
>>mpi3-ft mailing list
>>mpi3-ft at lists.mpi-forum.org
>>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list