[mpiwg-ft] FTWG Con Call 2013-09-03

Aurélien Bouteiller bouteill at icl.utk.edu
Thu Sep 5 04:58:07 CDT 2013

I cannot read steadily my emails. I am in backward europe now :) 

Anyway, George statement is correct. I found a circumvolved way to use the single call, it works, but it is very ugly and extremely difficult to understand/operate. The split operation is much simpler to use. Additionnaly for several use cases (where the user just want to resume ANY_SOURCE and does not care about the failed group) it can spare quite a significant memory overhead on group storage. I advise we keep it as-is. 

I won't be very active in the next 2 weeks. I hope for the best :) 


Le 4 sept. 2013 à 21:19, Wesley Bland <wbland at mcs.anl.gov> a écrit :

> His email was designed to show that it is possible to use either case reliably. In fact, both cases require this check to ensure that you aren't acknowledging failures that you didn't mean to acknowledge. The only difference is that they're combined into one call. You still need to be careful to ensure that you don't acknowledge failures while another thread is using an ANY_SOURCE. So in the end, they should be equivalent.
> Wesley
> On Sep 3, 2013, at 5:13 PM, George Bosilca <bosilca at icl.utk.edu> wrote:
>> On Sep 3, 2013, at 21:40 , Wesley Bland <wbland at mcs.anl.gov> wrote:
>>> Due to lack of participate (just Manju & I), we cancelled this week's call. We didn't have anything major to discuss today as we have the forum meeting next week and things have mostly been prepped for that meeting. I distributed the most recent version of the slides this morning. I don't think there are any more changes to be made there other than deciding about the failure_ack/get_acked vs. get_failed semantic. Once we make a decision, we need to remove one slide or the other. Which brings me to the other point I was going to discuss today:
>>> Aurelien provided an example use case for the MPI_COMM_GET_FAILED semantic just before the call last week.
>> Reading Aurelien's email I have a different understanding. It seems that he provides an example on how it can be used, and not a use case. While the example seems to be correct it is extremely complex, not something we want to enforce on users.
>> The current approach with two separate calls ACK + GET_ACK allows for a simpler management of errors while enabling the developers to write different, more flexible, approaches to deal with failures. I fail to see the need to replace such a flexible mechanism with a call that only allows a single usage pattern.
>>   George.
>>> Since we didn't have much time with it, we decided to wait and discuss it this week. I've been looking at it and the idea seems solid to me. It's possible that there might be a semantic issue or two, but the rationale seems good. I propose that we combine MPI_COMM_FAILURE_ACK/GET_ACKED to create a new function MPI_COMM_GET_FAILED. Essentially the new function will do exactly the same thing as the previous functions, just without the separated semantics. The function header would look like this:
>>> int MPI_Comm_get_failed(MPI_Comm comm, MPI_Group *failed_group);
>>> Where ''failed_group'' is the group of processes which is locally known to have failed in ''comm''. Obviously there will need to be some textual changes in the chapter to reflect this change. For the slides, essentially, we just need to take out slide 7 and remove "[Alternative]" from the title of slide 8. I won't bother sending out another version of the slides for this. Rich, can you make these changes in your version?
>>> Other than that, we can talk about future plans on the next call. My participation over the next month or two might be intermittent. My wife and I are expecting our first baby in the next 2-3 weeks so I might miss a couple of calls.
>>> Thanks,
>>> Wesley
>>> On Tue, Sep 3, 2013 at 8:58 AM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>> Dear WG members,
>>> This is a reminder that according to our planning, we are having our regular phone meeting today at 3pm EDT.
>>> NOTE THE NEW CALL-IN NUMBER. This a permanent change from the old number.
>>> Date: September 3,
>>> Time: 3pm EDT/New York
>>> Dial-in information: 712-432-0360
>>> Code: 623998#
>>> Agenda:
>>> * Discuss Aurelien's example for MPI_COMM_GET_FAILED
>>> * Plan for Madrid meeting (quick skim of slides, including changes which happened after last call)
>>> * Discuss plan for proposal moving toward December reading
>>> Next Meetings:
>>> * September 17, 2013
>>> * October 1, 2013
>>> * October 15, 2013
>>> * October 29, 2013
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft

* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375

More information about the mpiwg-ft mailing list