[Mpi3-ft] ULFM Slides for Madrid

Richard Graham richardg at mellanox.com
Tue Aug 20 08:22:00 CDT 2013


BTW, what time is the call today ?

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
Sent: Tuesday, August 20, 2013 9:13 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group; George Bosilca
Subject: Re: [Mpi3-ft] ULFM Slides for Madrid

On Aug 19, 2013, at 5:48 PM, George Bosilca <bosilca at icl.utk.edu<mailto:bosilca at icl.utk.edu>> wrote:


Wesley, all,

Here are few comments/suggestions on the slides.

Slide 7: There is a mention about "re-enabling wildcard operations". While this is technically true, it is only a side effect of the real operation, acknowledging the local understanding of failure state. This is the reason why the corresponding function is called MPI_Comm_failure_ack and not MPI_Comm_reenable_any_source.

I've reversed those two bullets and added a few more words to make it more clear that getting the failed processes is the primary purpose:

"Re-enables wildcard operations on a communicator now that the user knows about the failures"



Slide 8: - "as it impacts completion" ? What completion?

New text: "Let the application discover the error as it impacts correct completion of an operation."


         - "in the same communicator" is unclear.

I'm not sure what about this is unclear. If you can suggest some new text that would improve it, I would appreciate that.



Slide 9: I have few issues with this slide. "How does the application know which request to restart?" Well, if there is anybody that might have the slightest chance to know what requests are still needed ... it's the application. Second, I don't see the point of promising a follow-up proposal.

Part of the idea of these slides is to discuss the design rationale. One of the discussions we've had with a number of people is that making revoke a permanent operation is unnecessary. This slide describes why we think it is necessary to have as simple a proposal as possible. If we want more full-featured things, like a temporary revoke state, it's possible to do that, but it needs to happen later in order to not complicate this one.

I've softened the text to say that it "could" come as a follow-on proposal.



Slide 10: - Shouldn't be "failed processes"?

Yes. Fixed.


          - The need for collective communications is not the only reason to use MPI_Comm_shrink. I would use a more general formulation: "When collective knowledge is necessary...".

It isn't the only reason, but we're not trying to be cryptic in this talk. This is demonstrating a real use case for this function. There are others of course.


          - The MPI_Comm_shrink is doing more than just creating a slimmed down communicator. It validates a global view of all the failed processes in the original communicator on the participating nodes. From my perspective this is more important that creating the new communicator.

You're right. This is one of the things we discussed at the UTK face-to-face that I failed to add to the slides. Shrink can be used as a replacement for acquiring knowledge of global failures at the same cost as creating a function that would do this explicitly. I've added the following text:

* Can also be used to validate knowledge of all failures in a communicator.
  * Shrink the communicator, compare the new group to the old one, free the new communicator (if not needed).
  * Same cost as querying all processes to learn about all failures



Slide 11: I would suggest to change the wording in order to replace "throw away" by "release". The example on the next slide is doing exactly this.

In my mind it (informally) means the same thing, but if we need to be precise on these slides, so be it. I've changed that.



Slide 12: This example is __not__ correct as using the same pointer as send and receive buffer in the MPI_Allreduce (use MPI_IN_PLACE instead) is clearly forbidden my the standard.

Fixed. Lazy coding.



Slide 13: I would be careful what you wish for. There are very good reasons why an MPI_Comm_free is a collective call. I would think a little more about this before pushing for a radical requirement.

Of course it should still be a collective call. This is only saying that if everything else is broken, you should still have the option to free the memory associated with the handle. What are some of the downsides for this? The pending operations on the communicator were either already going to fail or should be able to complete (collectives fail, pt2pt complete). The implementation probably needs to be careful about reference counting to make sure that the handle isn't being pulled out from under something that's still using it, but that shouldn't be a big problem.



Slide 16: This example is not correct without an explicit agreement at every level up the stack. There are many ways for it to fail, too many to let it on the wild.

You're right that this isn't a complete example, but it is there to convey the general idea. If the group thinks it's doing more harm than good by being in the slides, it can go, but library composition is something that we've been asked about many times. Should we trash this and come up with something more extended?

Another version of the corrected slides is attached.

Thanks,
Wesley




  George.



On Aug 16, 2013, at 23:05 , "Sur, Sayantan" <sayantan.sur at intel.com<mailto:sayantan.sur at intel.com>> wrote:


Ah, gotcha.

Sayantan

From: mpi3-ft-bounces at lists.mpi-forum.org<mailto:mpi3-ft-bounces at lists.mpi-forum.org> [mailto:mpi3-ft-bounces at lists.mpi-forum.org<mailto:ft-bounces at lists.mpi-forum.org>] On Behalf Of Wesley Bland
Sent: Friday, August 16, 2013 1:55 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] ULFM Slides for Madrid

I think my slide was unclear. The case I meant was if a process failed before the Allreduce. In that case, the Allreduce would always fail.. If the failure occurs during the algorithm, as you pointed out, it wouldn't necessarily fail everywhere.

Thanks,
Wesley

On Friday, August 16, 2013 at 3:51 PM, Sur, Sayantan wrote:
Hi Wesley,

Thanks for sending the slides around. Does the assertion on Slide 6 and example on Slide 12 that "Allreduce would always fail" (in the case of failure of one of the participants) hold true?

For example, an MPI implementation might have a terrible implementation of allreduce, where participating ranks send their buffer to a root, which does the reduction. The root then sends the results back to the participants one after the other. One of these p2p sends then fails. In this case, isn't it possible that one rank gets MPI_ERR_PROC_FAILED, whereas the others get MPI_SUCCESS?

Thanks,
Sayantan

From: mpi3-ft-bounces at lists.mpi-forum.org<mailto:mpi3-ft-bounces at lists.mpi-forum.org> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
Sent: Friday, August 16, 2013 10:17 AM
To: MPI3-FT Working Group
Subject: [Mpi3-ft] ULFM Slides for Madrid

I've put together a first draft of some slides that give an overview of ULFM for the forum meeting in Madrid for Rich to present. I think I captured most of the discussion we had on the last call relating to rationale, but if I missed something, feel free to add that to this deck or send me edits.

I think the plan of action, as I understand it from Rich and Geoffroy, is to iterate on these slides until the next call on Tuesday and then we'll go over them as a group to make sure we're all on the same page. Rich, will you be able to attend the call this week (Tuesday, 3:00 PM EST)? If not, we can adjust it this week to make sure you can be there.

Just to be clear, the goal of this presentation is to provide an overview of ULFM for the European crown that can't usually attend the forum meetings. This will probably be a review for many of the people who attend regularly, but there is some new rationale that we haven't included in the past when we've been putting these presentations together. I'd imagine that this meeting will have some confusion from the attendees where they might remember parts of the previous proposals and mix them, but if we can tell them to do a memory wipe ahead of time, that would help.

Let me know what I've missed.

Thanks,
Wesley
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org<mailto:mpi3-ft at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130820/db1f39e3/attachment-0001.html>


More information about the mpiwg-ft mailing list