[Mpi3-ft] ULFM Slides for Madrid

Wesley Bland wbland at mcs.anl.gov
Mon Aug 19 13:50:05 CDT 2013


On Aug 19, 2013, at 12:57 PM, Jim Dinan <james.dinan at gmail.com> wrote:

> Hi Wesley,
> 
> Overall, this is going in a great direction.  Thanks for all of the effort you and others have been putting into the presentation.
> 
> Slide 4: "Only requirement is that failure are eventually reported to all processes which communicate with the failed process."  I am wondering if this statement, to be precise, applies only to point-to-point communication?  Is it possible for me to never find out about a failure of another process with whom I make collective calls?

I've updated the slide to say this:

Only requirement is that failure are eventually reported if they prevent correct completion of another operation.

The intent of the text is to say that whenever the failure of a process will keep the correct thing from happening from my point of view, you must tell me about it. That means that if I'm doing pt2pt with the dead process, obviously I need to know about it because there is no way to get the right answer (unless it was sent before the process died). If I'm doing a collective that can complete correctly without the failed process, for instance a broadcast where one of the leaves failed, then I don't need to know about the failure. Basically, as long as all of the buffers are correct, then we don't need to report an error.

> 
> Slide 5: I haven't heard the term "algorithm completion" before.  Would it be better to call this something like fault tolerant consensus?

That makes more sense. Updated accordingly.

> 
> Slide 11: You should point out that we have no way to free a communicator, if it is not valid at all correct processes.  Processes that noticed the failure wouldn't be able to call revoke, since they don't have a valid handle to the communicator.

I didn't point out anywhere that we've modified the semantics of MPI_Comm_free to allow the implementation to free local resources if it's no longer possible to carry out the rest of the semantics for MPI_Comm_free. Said another way, if your communicator has been broken because of failures so collectives no longer work, then you can still call MPI_Comm_free and free all of the local resources to prevent a memory leak. I've added a slide to clarify this.

(I don't see this text in the document anywhere. We need to fix this. I'm not sure where this text went.)

> 
> Slide 12: Should you clean up the communicator somehow, if the creation failed?  Do you revoke or free it?  Is this semantic defined -- seems like the same issue as on slide 11?

Fixed on the slide per above text.

Thanks,
Wesley

> 
>  ~Jim.
> 
> 
> On Fri, Aug 16, 2013 at 1:17 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
> I've put together a first draft of some slides that give an overview of ULFM for the forum meeting in Madrid for Rich to present. I think I captured most of the discussion we had on the last call relating to rationale, but if I missed something, feel free to add that to this deck or send me edits.
> 
> I think the plan of action, as I understand it from Rich and Geoffroy, is to iterate on these slides until the next call on Tuesday and then we'll go over them as a group to make sure we're all on the same page. Rich, will you be able to attend the call this week (Tuesday, 3:00 PM EST)? If not, we can adjust it this week to make sure you can be there.
> 
> Just to be clear, the goal of this presentation is to provide an overview of ULFM for the European crown that can't usually attend the forum meetings. This will probably be a review for many of the people who attend regularly, but there is some new rationale that we haven't included in the past when we've been putting these presentations together. I'd imagine that this meeting will have some confusion from the attendees where they might remember parts of the previous proposals and mix them, but if we can tell them to do a memory wipe ahead of time, that would help.
> 
> Let me know what I've missed.
> 
> Thanks,
> Wesley
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130819/12a0b2ca/attachment-0001.html>


More information about the mpiwg-ft mailing list