[Mpi3-ft] ULFM Slides for Madrid
Wesley Bland
wbland at mcs.anl.gov
Wed Aug 21 08:49:24 CDT 2013
Here's the latest version of the slides from yesterday's call. This now includes:
* Misc. small corrections throughout
* [8] An alternate slide for failure notification if we decide to combine failure_ack & get_acked to reduce confusion.
* [10] Reworked rationale for MPI_Comm_revoke to reduce confusion about a possible (though not used) alternative. I agree that this slide is still weak. It may be good enough to just say that we can implement this functionality on top of MPI and leave it at that. Then we could just remove this slide entirely.
* [13] Corrected code describing how to validate a communicator after calling a creation function
* [16] New slide describing one-sided semantics including the state of memory after a failure.
* [17] New slide describing how to handle passive target locks after a failure.
* [18] New slide describing file I/O semantics.
* [20 - 27] New slides with extended library construction example using ScaLAPACK.
* [28] New slide describing implementation status
More comments are always welcome.
Thanks,
Wesley
On Aug 20, 2013, at 10:35 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
> <2013-09 MPI Forum ULFM.pptx>
>
> On Aug 20, 2013, at 10:17 AM, George Bosilca <bosilca at icl.utk.edu> wrote:
>
>>
>> On Aug 20, 2013, at 15:12 , Wesley Bland <wbland at mcs.anl.gov> wrote:
>>
>>> On Aug 19, 2013, at 5:48 PM, George Bosilca <bosilca at icl.utk.edu> wrote:
>>>
>>>> Wesley, all,
>>>>
>>>> Here are few comments/suggestions on the slides.
>>>>
>>>> Slide 7: There is a mention about "re-enabling wildcard operations". While this is technically true, it is only a side effect of the real operation, acknowledging the local understanding of failure state. This is the reason why the corresponding function is called MPI_Comm_failure_ack and not MPI_Comm_reenable_any_source.
>>>
>>> I've reversed those two bullets and added a few more words to make it more clear that getting the failed processes is the primary purpose:
>>>
>>> "Re-enables wildcard operations on a communicator now that the user knows about the failures"
>>>
>>>>
>>>> Slide 8: - "as it impacts completion" ? What completion?
>>>
>>> New text: "Let the application discover the error as it impacts correct completion of an operation."
>>>
>>>> - "in the same communicator" is unclear.
>>>
>>> I'm not sure what about this is unclear. If you can suggest some new text that would improve it, I would appreciate that.
>>
>> Use comm communicator. The "same" part of the sentence was unclear to me, same as what?
>
> Got it. That's been fixed.
>
>>
>>>> Slide 9: I have few issues with this slide. "How does the application know which request to restart?" Well, if there is anybody that might have the slightest chance to know what requests are still needed … it's the application. Second, I don't see the point of promising a follow-up proposal.
>>>
>>> Part of the idea of these slides is to discuss the design rationale. One of the discussions we've had with a number of people is that making revoke a permanent operation is unnecessary.
>>
>> This was one the weirdest things we did in FT-MPI, to not make the revoke a permanent operation. After few faults your collective operations outcome is absolutely horrible to understand, and their implementations are hard as hell to validate. Not worth the effort from my perspective.
>>
>>> This slide describes why we think it is necessary to have as simple a proposal as possible. If we want more full-featured things, like a temporary revoke state, it's possible to do that, but it needs to happen later in order to not complicate this one.
>>>
>>> I've softened the text to say that it "could" come as a follow-on proposal.
>>
>> OK.
>>
>>>>
>>>> Slide 10: - Shouldn't be "failed processes"?
>>>
>>> Yes. Fixed.
>>>
>>>> - The need for collective communications is not the only reason to use MPI_Comm_shrink. I would use a more general formulation: "When collective knowledge is necessary…".
>>>
>>> It isn't the only reason, but we're not trying to be cryptic in this talk. This is demonstrating a real use case for this function. There are others of course.
>>
>> A demonstration of a real use-case should be indicated as such on the slide. All the slides are general, I would have expected this to follow on the same trend.
>>
>>>
>>>> - The MPI_Comm_shrink is doing more than just creating a slimmed down communicator. It validates a global view of all the failed processes in the original communicator on the participating nodes. From my perspective this is more important that creating the new communicator.
>>>
>>> You're right. This is one of the things we discussed at the UTK face-to-face that I failed to add to the slides. Shrink can be used as a replacement for acquiring knowledge of global failures at the same cost as creating a function that would do this explicitly. I've added the following text:
>>>
>>> * Can also be used to validate knowledge of all failures in a communicator.
>>> * Shrink the communicator, compare the new group to the old one, free the new communicator (if not needed).
>>> * Same cost as querying all processes to learn about all failures
>>
>> +1
>>
>>>> Slide 11: I would suggest to change the wording in order to replace "throw away" by "release". The example on the next slide is doing exactly this.
>>>
>>> In my mind it (informally) means the same thing, but if we need to be precise on these slides, so be it. I've changed that.
>>>
>>>>
>>>> Slide 12: This example is __not__ correct as using the same pointer as send and receive buffer in the MPI_Allreduce (use MPI_IN_PLACE instead) is clearly forbidden my the standard.
>>>
>>> Fixed. Lazy coding.
>>
>> My gosh!
>>
>>>> Slide 13: I would be careful what you wish for. There are very good reasons why an MPI_Comm_free is a collective call. I would think a little more about this before pushing for a radical requirement.
>>>
>>> Of course it should still be a collective call. This is only saying that if everything else is broken, you should still have the option to free the memory associated with the handle. What are some of the downsides for this? The pending operations on the communicator were either already going to fail or should be able to complete (collectives fail, pt2pt complete). The implementation probably needs to be careful about reference counting to make sure that the handle isn't being pulled out from under something that's still using it, but that shouldn't be a big problem.
>>
>> I was more intrigued about the management of the comm_id in this case. There are ways to ensure this is correctly handled, but they need some thinking.
>
> Yes. The implementation would probably have to not reuse that comm_id since there'd be no way to ensure that it's available on all processes. That's an implementation detail that is solvable though.
>
>>
>>>> Slide 16: This example is not correct without an explicit agreement at every level up the stack. There are many ways for it to fail, too many to let it on the wild.
>>>
>>> You're right that this isn't a complete example, but it is there to convey the general idea. If the group thinks it's doing more harm than good by being in the slides, it can go, but library composition is something that we've been asked about many times. Should we trash this and come up with something more extended?
>>
>> Then make it clear in the slide that this is __not__ a correct sample, but just a high level overview. The slide title state Example, and I bet most people will take as one without seeing all the pitfalls.
>
> Done
>
>>
>> George.
>>
>>>
>>> Another version of the corrected slides is attached.
>>>
>>> Thanks,
>>> Wesley
>>>
>>> <2013-09 MPI Forum ULFM.pptx>
>>>
>>>>
>>>> George.
>>>>
>>>>
>>>>
>>>> On Aug 16, 2013, at 23:05 , "Sur, Sayantan" <sayantan.sur at intel.com> wrote:
>>>>
>>>>> Ah, gotcha.
>>>>>
>>>>> Sayantan
>>>>>
>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
>>>>> Sent: Friday, August 16, 2013 1:55 PM
>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>> Subject: Re: [Mpi3-ft] ULFM Slides for Madrid
>>>>>
>>>>> I think my slide was unclear. The case I meant was if a process failed before the Allreduce. In that case, the Allreduce would always fail.. If the failure occurs during the algorithm, as you pointed out, it wouldn't necessarily fail everywhere.
>>>>>
>>>>> Thanks,
>>>>> Wesley
>>>>> On Friday, August 16, 2013 at 3:51 PM, Sur, Sayantan wrote:
>>>>>
>>>>> Hi Wesley,
>>>>>
>>>>> Thanks for sending the slides around. Does the assertion on Slide 6 and example on Slide 12 that “Allreduce would always fail” (in the case of failure of one of the participants) hold true?
>>>>>
>>>>> For example, an MPI implementation might have a terrible implementation of allreduce, where participating ranks send their buffer to a root, which does the reduction. The root then sends the results back to the participants one after the other. One of these p2p sends then fails. In this case, isn’t it possible that one rank gets MPI_ERR_PROC_FAILED, whereas the others get MPI_SUCCESS?
>>>>>
>>>>> Thanks,
>>>>> Sayantan
>>>>>
>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Wesley Bland
>>>>> Sent: Friday, August 16, 2013 10:17 AM
>>>>> To: MPI3-FT Working Group
>>>>> Subject: [Mpi3-ft] ULFM Slides for Madrid
>>>>>
>>>>> I've put together a first draft of some slides that give an overview of ULFM for the forum meeting in Madrid for Rich to present. I think I captured most of the discussion we had on the last call relating to rationale, but if I missed something, feel free to add that to this deck or send me edits.
>>>>>
>>>>> I think the plan of action, as I understand it from Rich and Geoffroy, is to iterate on these slides until the next call on Tuesday and then we'll go over them as a group to make sure we're all on the same page. Rich, will you be able to attend the call this week (Tuesday, 3:00 PM EST)? If not, we can adjust it this week to make sure you can be there.
>>>>>
>>>>> Just to be clear, the goal of this presentation is to provide an overview of ULFM for the European crown that can't usually attend the forum meetings. This will probably be a review for many of the people who attend regularly, but there is some new rationale that we haven't included in the past when we've been putting these presentations together. I'd imagine that this meeting will have some confusion from the attendees where they might remember parts of the previous proposals and mix them, but if we can tell them to do a memory wipe ahead of time, that would help.
>>>>>
>>>>> Let me know what I've missed.
>>>>>
>>>>> Thanks,
>>>>> Wesley
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130821/0944bf12/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2013-09 MPI Forum ULFM.pptx
Type: application/vnd.openxmlformats-officedocument.presentationml.presentation
Size: 1073927 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130821/0944bf12/attachment-0001.pptx>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130821/0944bf12/attachment-0003.html>
More information about the mpiwg-ft
mailing list