[mpiwg-ft] A meeting this week

Jim Dinan james.dinan at gmail.com
Fri Nov 22 14:51:51 CST 2013


Got it, thanks guys.

 ~Jim.


On Fri, Nov 22, 2013 at 3:15 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:

> No, it's a correctness thing. You're only required to detect failures for
> processes you're actively communicating with (in a collective,
> receiving/sending a message). If you're not directly communicating, the
> implementation isn't required to notify you of a failure (and it would be
> confusing if it did). In this case, you might need the revoke to notify the
> other processes manually  an example might be a stencil computation where
> nor process detects a failure and decides to branch to a recovery path. You
> might need to revoke the communicator first before branching.
>
> Wesley
>
> On Nov 22, 2013, at 2:02 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>
> So, what is the argument for having MPI_Comm/win_revoke?  Is it a
> performance, rather than a correctness argument?
>
>  ~Jim.
>
>
> On Fri, Nov 22, 2013 at 2:01 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>
>> It doesn't require an asynchronous failure detector. It does require that
>> you detect failures (in a unspecified way) insofar as it prevents
>> completion. Once you enter the MPI library, you have to use some sort of
>> detector (probably from the runtime level) to keep from getting deadlocked.
>>
>> Wesley
>>
>> On Nov 22, 2013, at 12:53 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>
>> The latter case requires a failure detector, right?  I had thought the
>> current design would avoid this requirement.
>>
>>  ~Jim.
>>
>>
>> On Thu, Nov 21, 2013 at 12:06 PM, Wesley Bland <wbland at mcs.anl.gov>wrote:
>>
>>> No problem about missing the call yesterday. It was a last minute thing.
>>> I think we're in good shape to submit the text on Monday, but we're just
>>> doing some final passes over the text. There were a few changes that
>>> Aurélien will be making and I'm getting Gail to do an English pass, but
>>> overall, it's still essentially the same as what we read last year (per the
>>> request from the forum).
>>>
>>> There are a couple of ways out of this deadlock. The first is, as you
>>> mentioned, to have a function in the library to essentially manually
>>> trigger an error handler and let the library figure out what is wrong. This
>>> method would work, but it is a bit heavy handed. The alternative solution
>>> is that the wildcard on process X should return an error because the
>>> failure of process Y meets the definition if an "involved process." Process
>>> X will get an exception (or an MPI_Errhandler) and can trigger the recovery
>>> path.
>>>
>>> Either way should work, but the later is obviously the preferred and
>>> expected solution.
>>>
>>> Thanks,
>>> Wesley
>>>
>>> On Nov 21, 2013, at 10:58 AM, Jim Dinan <james.dinan at gmail.com> wrote:
>>>
>>> Hi Guys,
>>>
>>> Sorry I wasn't able to attend.  I'm back from SC now, if you need me.
>>>
>>> I have a concern about the current approach to revoking communicators.
>>>  Consider a program that uses a library with a communicator, CL, that is
>>> private to the library.  Process X makes a call to this library and
>>> performs a wildcard receive on CL.  Process Y fails; Y would have sent a
>>> message to X on CL.  Process Z sees that Y failed, but it sees it in the
>>> user code, outside of the library.  Process Z cannot call revoke on CL
>>> because it does not have any knowledge about how the library is implemented
>>> and it does not have a handle to CL.
>>>
>>> This seems like a situation that will result in deadlock, unless the
>>> library is also extended to include a "respond to process failure"
>>> function.  Is this handled in some other way, and I'm just not seeing it?
>>>
>>> It seems like the revoke(comm) approach requires the programmer to know
>>> about all communication and all communicators/windows in use in their
>>> entire application, including those contained within libraries.  Is that a
>>> correct assessment?
>>>
>>>  ~Jim.
>>>
>>>
>>> On Wed, Nov 20, 2013 at 2:39 PM, Aurélien Bouteiller <
>>> bouteill at icl.utk.edu> wrote:
>>>
>>>> Rich, this is a followup of the proofreading work done during the
>>>> regular meeting we had last week, and everybody, including SC attendees,
>>>> had a chance to join. I am sorry you couldn’t.
>>>>
>>>> Anyway, here is the working document for today: all diffs since the
>>>> introduction of the new RMA chapter 5 month ago.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Le 19 nov. 2013 à 17:07, Richard Graham <richardg at mellanox.com> a
>>>> écrit :
>>>>
>>>> > With SC this week this is poor timing
>>>> >
>>>> > Rich
>>>> >
>>>> > ------Original Message------
>>>> > From: Wesley Bland
>>>> > To: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>> > Cc: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>>> > ReplyTo: MPI WG Fault Tolerance and Dynamic Process Control working
>>>> Group
>>>> > Subject: Re: [mpiwg-ft] A meeting this week
>>>> > Sent: Nov 19, 2013 2:13 PM
>>>> >
>>>> > Ok. I'll be there. I'll send it off for an editing today.
>>>> >
>>>> > Wesley
>>>> >
>>>> >> On Nov 19, 2013, at 3:12 PM, Aurélien Bouteiller <
>>>> bouteill at icl.utk.edu> wrote:
>>>> >>
>>>> >> Dear WG members,
>>>> >>
>>>> >> We have been misreading the new forum rules. We have to buckle the
>>>> text of the proposal this week and not in 2 weeks from now, so time is
>>>> running short. I would like to invite you to a supplementary meeting
>>>> tomorrow to make a review of the text together.
>>>> >>
>>>> >> Jim, I don’t know if you will be able to attend on short notice, but
>>>> your input would be greatly appreciated.
>>>> >>
>>>> >> Date: Wed, November 20,
>>>> >> Time: 3pm EDT/New York
>>>> >> Dial-in information: 712-432-0360
>>>> >> Code: 623998#
>>>> >>
>>>> >> Agenda:
>>>> >> Review of ULFM text and final work.
>>>> >>
>>>> >> Aurelien
>>>> >>
>>>> >> --
>>>> >> * Dr. Aurélien Bouteiller
>>>> >> * Researcher at Innovative Computing Laboratory
>>>> >> * University of Tennessee
>>>> >> * 1122 Volunteer Boulevard, suite 309b
>>>> >> * Knoxville, TN 37996
>>>> >> * 865 974 9375
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> mpiwg-ft mailing list
>>>> >> mpiwg-ft at lists.mpi-forum.org
>>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>> > _______________________________________________
>>>> > mpiwg-ft mailing list
>>>> > mpiwg-ft at lists.mpi-forum.org
>>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>> > _______________________________________________
>>>> > mpiwg-ft mailing list
>>>> > mpiwg-ft at lists.mpi-forum.org
>>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>
>>>> --
>>>> * Dr. Aurélien Bouteiller
>>>> * Researcher at Innovative Computing Laboratory
>>>> * University of Tennessee
>>>> * 1122 Volunteer Boulevard, suite 309b
>>>> * Knoxville, TN 37996
>>>> * 865 974 9375
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpiwg-ft mailing list
>>>> mpiwg-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>>
>>>
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>
>>>
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131122/ae0bb561/attachment-0001.html>


More information about the mpiwg-ft mailing list