[mpi3-ft] FW: Starting slides for 2/1/2008 telecon

Richard Graham rlgraham at ornl.gov
Wed Jan 30 08:38:40 CST 2008




On 1/30/08 4:40 AM, "Thomas Herault" <thomas.herault at lri.fr> wrote:

> 
> 
> Le 30 janv. 08 à 09:17, Greg Bronevetsky a écrit :
> 
>> > Two comments:
>> >
>> > We may want to add the capability to spawn new processes and give
>> > them the ranks of the failed processes. This is more efficient than
>> > pre-allocating enough spare processes as part of the original job
>> > allocation, so it might be a good idea to include in the spec.
>> >
> 
> A user should be able to spawn new processes and create a new
> communicator with enough processes to circumvent this issue.
> 
> I think that size of MPI_COMM_WORLD should decrease
> when failures occur. Not all fault tolerance solutions need processor
> replacement (e.g. a master/worker approach when a worker fails).
> But MPI_UNIVERSE_SIZE should be allowed to remain constant
> (thus, the system can provide dynamically new processors to
> replace failed ones).
> 
>>> >> No disagreements.  As I mentioned earlier, we need to start with the
>>> >> non-trivial amount of work that has already been done in this area.  At
>>> >> this stage I am deliberately staying away from trying to propose specific
>>> >> solutions ­ we need to first define what we think we can address, and
>>> >> what we will just pass on.
> 
>> > I disagree with the comments about MPI quieting the communication
>> > system because this presumes that the application will use the
>> > trivial sync-and-stop CPR protocol. They may the case but we
>> > shouldn't write this assumption into the spec. We should probably
>> > restrict ourselves to only saying that no message may get partially
>> > delivered since such messages would be very hard to deal with above
>> > the MPI library.
>> >
> 
> I agree that we should not assume that synchronized CPR will be the
> only approach used. Messages must certainly be kept transactional
> (either completely received or not at all). But what about collective
> communications? As suggested during the meeting, a two-phase commit
> protocol can be enforced to ensure that any collective communication
> either completes or fails on any living processor; however this may be
> considered as too inefficient for the normal case, when failures do not
> occur.
> 
>>> >>  No assumption on solution is being made.  At this stage I believe that
>>> the
>>> >> primitives listed will cover the current solutions people have deployed.
>>> >> The intent is to provide as set of tools that can be used as needed ­ >>>
i.e.
>>> >> the standard should not be advocating a solution, but enabling solutions
>>> >> that don¹t involve modifying every MPI implementation (if possible).  If
>>> there
>>> >> are missing primitives, please bring them up.  If there are primitives
>>> listed
>>> >> that are not needed, please mention these too.
> 
> Rich
> 
> Thomas Herault
> assistant professor
> INRIA/Univ. Paris Sud
> 
>> > Greg Bronevetsky
>> > Post-Doctoral Researcher
>> > 1028 Building 451
>> > Lawrence Livermore National Lab
>> > (925) 424-5756
>> > bronevetsky1 at llnl.gov
>> >
>> > At 10:12 AM 1/29/2008, Richard Graham wrote:
>>> >> This did not seem to make it through the first time, so let me try
>>> >> again.
>>> >>
>>> >> Rich
>>> >>
>>> >> ------ Forwarded Message
>>> >> From: Richard Graham <rlgraham at ornl.gov>
>>> >> Date: Tue, 29 Jan 2008 10:55:11 -0500
>>> >> To: Discussion of MPI 3 Fault Tolerance Support <mpi3-ft at cs.uiuc.edu>
>>> >> Conversation: Starting slides for 2/1/2008 telecon
>>> >> Subject: Starting slides for 2/1/2008 telecon
>>> >>
>>> >> Attached is a set of slides I intend to use as a staring point for
>>> >> the
>>> >> telecon this coming Friday.  If you are planning on attending,
>>> >> please take a
>>> >> look at these, and see what is missing.  The main goal for this
>>> >> call is to
>>> >> help set the scope of the problem for which we intend to propose
>>> >> changes to
>>> >> the MPI standard.
>>> >>
>>> >> Thanks,
>>> >> Rich
>>> >>
>>> >> ------ End of Forwarded Message
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> mpi3-ft mailing list
>>> >> mpi3-ft at cs.uiuc.edu
>>> >> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
>> > _______________________________________________
>> > mpi3-ft mailing list
>> > mpi3-ft at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20080130/a14916e6/attachment-0001.html>


More information about the mpiwg-ft mailing list