[mpiwg-ft] A meeting this week

Wesley Bland wbland at mcs.anl.gov
Fri Nov 22 14:15:33 CST 2013


No, it's a correctness thing. You're only required to detect failures for processes you're actively communicating with (in a collective, receiving/sending a message). If you're not directly communicating, the implementation isn't required to notify you of a failure (and it would be confusing if it did). In this case, you might need the revoke to notify the other processes manually  an example might be a stencil computation where nor process detects a failure and decides to branch to a recovery path. You might need to revoke the communicator first before branching. 

Wesley

> On Nov 22, 2013, at 2:02 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> 
> So, what is the argument for having MPI_Comm/win_revoke?  Is it a performance, rather than a correctness argument?
> 
>  ~Jim.
> 
> 
>> On Fri, Nov 22, 2013 at 2:01 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>> It doesn't require an asynchronous failure detector. It does require that you detect failures (in a unspecified way) insofar as it prevents completion. Once you enter the MPI library, you have to use some sort of detector (probably from the runtime level) to keep from getting deadlocked. 
>> 
>> Wesley
>> 
>>> On Nov 22, 2013, at 12:53 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>> 
>>> The latter case requires a failure detector, right?  I had thought the current design would avoid this requirement.
>>> 
>>>  ~Jim.
>>> 
>>> 
>>>> On Thu, Nov 21, 2013 at 12:06 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>>>> No problem about missing the call yesterday. It was a last minute thing. I think we're in good shape to submit the text on Monday, but we're just doing some final passes over the text. There were a few changes that Aurélien will be making and I'm getting Gail to do an English pass, but overall, it's still essentially the same as what we read last year (per the request from the forum).
>>>> 
>>>> There are a couple of ways out of this deadlock. The first is, as you mentioned, to have a function in the library to essentially manually trigger an error handler and let the library figure out what is wrong. This method would work, but it is a bit heavy handed. The alternative solution is that the wildcard on process X should return an error because the failure of process Y meets the definition if an "involved process." Process X will get an exception (or an MPI_Errhandler) and can trigger the recovery path. 
>>>> 
>>>> Either way should work, but the later is obviously the preferred and expected solution. 
>>>> 
>>>> Thanks,
>>>> Wesley
>>>> 
>>>>> On Nov 21, 2013, at 10:58 AM, Jim Dinan <james.dinan at gmail.com> wrote:
>>>>> 
>>>>> Hi Guys,
>>>>> 
>>>>> Sorry I wasn't able to attend.  I'm back from SC now, if you need me.
>>>>> 
>>>>> I have a concern about the current approach to revoking communicators.  Consider a program that uses a library with a communicator, CL, that is private to the library.  Process X makes a call to this library and performs a wildcard receive on CL.  Process Y fails; Y would have sent a message to X on CL.  Process Z sees that Y failed, but it sees it in the user code, outside of the library.  Process Z cannot call revoke on CL because it does not have any knowledge about how the library is implemented and it does not have a handle to CL.
>>>>> 
>>>>> This seems like a situation that will result in deadlock, unless the library is also extended to include a "respond to process failure" function.  Is this handled in some other way, and I'm just not seeing it?
>>>>> 
>>>>> It seems like the revoke(comm) approach requires the programmer to know about all communication and all communicators/windows in use in their entire application, including those contained within libraries.  Is that a correct assessment?
>>>>> 
>>>>>  ~Jim.
>>>>> 
>>>>> 
>>>>>> On Wed, Nov 20, 2013 at 2:39 PM, Aurélien Bouteiller <bouteill at icl.utk.edu> wrote:
>>>>>> Rich, this is a followup of the proofreading work done during the regular meeting we had last week, and everybody, including SC attendees, had a chance to join. I am sorry you couldn’t.
>>>>>> 
>>>>>> Anyway, here is the working document for today: all diffs since the introduction of the new RMA chapter 5 month ago.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Le 19 nov. 2013 à 17:07, Richard Graham <richardg at mellanox.com> a écrit :
>>>>>> 
>>>>>> > With SC this week this is poor timing
>>>>>> >
>>>>>> > Rich
>>>>>> >
>>>>>> > ------Original Message------
>>>>>> > From: Wesley Bland
>>>>>> > To: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>>>> > Cc: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>>>> > ReplyTo: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>>>> > Subject: Re: [mpiwg-ft] A meeting this week
>>>>>> > Sent: Nov 19, 2013 2:13 PM
>>>>>> >
>>>>>> > Ok. I'll be there. I'll send it off for an editing today.
>>>>>> >
>>>>>> > Wesley
>>>>>> >
>>>>>> >> On Nov 19, 2013, at 3:12 PM, Aurélien Bouteiller <bouteill at icl.utk.edu> wrote:
>>>>>> >>
>>>>>> >> Dear WG members,
>>>>>> >>
>>>>>> >> We have been misreading the new forum rules. We have to buckle the text of the proposal this week and not in 2 weeks from now, so time is running short. I would like to invite you to a supplementary meeting tomorrow to make a review of the text together.
>>>>>> >>
>>>>>> >> Jim, I don’t know if you will be able to attend on short notice, but your input would be greatly appreciated.
>>>>>> >>
>>>>>> >> Date: Wed, November 20,
>>>>>> >> Time: 3pm EDT/New York
>>>>>> >> Dial-in information: 712-432-0360
>>>>>> >> Code: 623998#
>>>>>> >>
>>>>>> >> Agenda:
>>>>>> >> Review of ULFM text and final work.
>>>>>> >>
>>>>>> >> Aurelien
>>>>>> >>
>>>>>> >> --
>>>>>> >> * Dr. Aurélien Bouteiller
>>>>>> >> * Researcher at Innovative Computing Laboratory
>>>>>> >> * University of Tennessee
>>>>>> >> * 1122 Volunteer Boulevard, suite 309b
>>>>>> >> * Knoxville, TN 37996
>>>>>> >> * 865 974 9375
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> mpiwg-ft mailing list
>>>>>> >> mpiwg-ft at lists.mpi-forum.org
>>>>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>>> > _______________________________________________
>>>>>> > mpiwg-ft mailing list
>>>>>> > mpiwg-ft at lists.mpi-forum.org
>>>>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>>> > _______________________________________________
>>>>>> > mpiwg-ft mailing list
>>>>>> > mpiwg-ft at lists.mpi-forum.org
>>>>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>>> 
>>>>>> --
>>>>>> * Dr. Aurélien Bouteiller
>>>>>> * Researcher at Innovative Computing Laboratory
>>>>>> * University of Tennessee
>>>>>> * 1122 Volunteer Boulevard, suite 309b
>>>>>> * Knoxville, TN 37996
>>>>>> * 865 974 9375
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpiwg-ft mailing list
>>>>>> mpiwg-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>> 
>>>>> _______________________________________________
>>>>> mpiwg-ft mailing list
>>>>> mpiwg-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>> 
>>>> _______________________________________________
>>>> mpiwg-ft mailing list
>>>> mpiwg-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>> 
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>> 
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
> 
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131122/96601c27/attachment-0001.html>


More information about the mpiwg-ft mailing list