[Mpi3-ft] Con Call on 1/4/2009

Wed Jan 21 10:05:07 CST 2009

Actually, something came up and I won't be able to make this con call 
either. However, I hope that we can have a good discussion over 
email. While I did not document the protocol in full algorithmic 
detail, I can point people to some of my prior papers such as 
http://greg.bronevetsky.com/papers/2003PPoPP.pdf as evidence that 
systems like this that work above MPI have successfully been 
implemented in the past.

Josh, you make a good point that relying on heavy-weight support in 
order to make the API useful suggests that the API is insufficient. 
My point in drafting my solution is that the only component required 
to make the API useful is one that does not need additional API 
modifications and thus, does not need to be standardized. Of course, 
we could put in the appropriate components into the official API but 
that would make it weaker rather than stronger. I can point to the 
ABI debate as an analogy for what I'm suggesting here. Option 1 is to 
standardize the ABI. Option 2 is to develop a heavy-weight support 
tool like MorphMPI that makes the current API more useful (i.e. 
resolves most of the issues that motivate the ABI) and does not 
require any additional standardization. I think that most people 
agree that Option 2 is the better choice.

The point here is that Fault Tolerance is hard and it requires a lot 
of runtime support to make application-implemented fault tolerance 
reasonably easy. We've identified much of the critical functionality 
that must be inside MPI in order for such functionality to be enjoyed 
by applications. We are now also identifying possible additional 
libraries that may also be necessary to make the API usable but that 
do not need to be standardized. In particular, we will probably come 
up with several such libraries that support a variety of application 
types (consider the minimal support required for monte carlo codes or 
non-modular codes), which is further evidence that this component, 
while important, should not be standardized.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 07:04 AM 1/21/2009, Josh Hursey wrote:
>I am not going to be able to make it to today's call due to travel.
>
>My primary concern is that the proposal relies a bit too heavily on
>some flavor of checkpointing or message logging in order to make the
>interface useful. There should be a set of guidelines that make the
>interface useful without a form of checkpointing or message logging on
>the system. Though I think the door should always be open to these
>types of additional functionality, but as far as the base
>specification I think it should be usable without them.
>
>Best,
>Josh
>
>P.S. I should have a revised interface for the following proposal in
>the next week or so:
>   https:// svn.mpi-forum.org/trac/mpi-forum-web/wiki/Quiescence
>
>On Jan 20, 2009, at 6:54 PM, Greg Bronevetsky wrote:
>
>>Here's my quick writeup of the major problems that we discussed with
>>writing modular apps on top of our proposed MPI fault tolerance spec
>>and an approach for making it relatively easy to write module- 
>>specific error recovery algorithms without worrying about other
>>modules. I've attached a pdf version as well as a txt version that
>>will be easier to edit.
>>
>>Greg Bronevetsky
>>Post-Doctoral Researcher
>>1028 Building 451
>>Lawrence Livermore National Lab
>>(925) 424-5756
>>bronevetsky1 at llnl.gov
>>
>>At 06:58 PM 1/13/2009, Richard Graham wrote:
>>>OK, we will resume the calls next week, 1/21/2009.
>>>
>>>Rich
>>>
>>>
>>>On 1/13/09 11:42 AM, "Greg Bronevetsky" <bronevetsky1 at llnl.gov>
>>>wrote:
>>>
>>> >
>>> >> Unfortunately, for reasons out of [my] control, I did not manage
>>>to
>>> >> get the time to update the wiki and I doubt I will find any time
>>> >> before the call tomorrow. I'll have time to get back to this
>>>starting
>>> >> from tomorrow morning.
>>> >>
>>> >> I second your idea to cancel the call tomorrow.
>>> >
>>> > I have a protocol worked out to do micro-rollbacks that will work
>>> > well if we add to the API some kind of asynchronous event
>>> > notification mechanism like active messages. It will work not as
>>>well
>>> > without the extension. I'll update George's document once its
>>>posted
>>> > so that we have a unified document that describes the problem and
>>>the
>>> > proposed solutions.
>>> >
>>> > Greg Bronevetsky
>>> > Post-Doctoral Researcher
>>> > 1028 Building 451
>>> > Lawrence Livermore National Lab
>>> > (925) 424-5756
>>> > bronevetsky1 at llnl.gov
>>> >
>>> > _______________________________________________
>>> > mpi3-ft mailing list
>>> > mpi3-ft at lists.mpi-forum.org
>>> > http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>_______________________________________________
>>>mpi3-ft mailing list
>>>mpi3-ft at lists.mpi-forum.org
>>>http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>><Support for Developing Fault Tolerant Modular MPI
>>Applications.pdf><Support for Developing Fault Tolerant Modular MPI
>>Applications.txt>_______________________________________________
>>mpi3-ft mailing list
>>mpi3-ft at lists.mpi-forum.org
>>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft