[mpiwg-ft] A meeting this week

Jim Dinan james.dinan at gmail.com
Fri Nov 22 14:02:08 CST 2013


So, what is the argument for having MPI_Comm/win_revoke?  Is it a
performance, rather than a correctness argument?

 ~Jim.


On Fri, Nov 22, 2013 at 2:01 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:

> It doesn't require an asynchronous failure detector. It does require that
> you detect failures (in a unspecified way) insofar as it prevents
> completion. Once you enter the MPI library, you have to use some sort of
> detector (probably from the runtime level) to keep from getting deadlocked.
>
> Wesley
>
> On Nov 22, 2013, at 12:53 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>
> The latter case requires a failure detector, right?  I had thought the
> current design would avoid this requirement.
>
>  ~Jim.
>
>
> On Thu, Nov 21, 2013 at 12:06 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:
>
>> No problem about missing the call yesterday. It was a last minute thing.
>> I think we're in good shape to submit the text on Monday, but we're just
>> doing some final passes over the text. There were a few changes that
>> Aurélien will be making and I'm getting Gail to do an English pass, but
>> overall, it's still essentially the same as what we read last year (per the
>> request from the forum).
>>
>> There are a couple of ways out of this deadlock. The first is, as you
>> mentioned, to have a function in the library to essentially manually
>> trigger an error handler and let the library figure out what is wrong. This
>> method would work, but it is a bit heavy handed. The alternative solution
>> is that the wildcard on process X should return an error because the
>> failure of process Y meets the definition if an "involved process." Process
>> X will get an exception (or an MPI_Errhandler) and can trigger the recovery
>> path.
>>
>> Either way should work, but the later is obviously the preferred and
>> expected solution.
>>
>> Thanks,
>> Wesley
>>
>> On Nov 21, 2013, at 10:58 AM, Jim Dinan <james.dinan at gmail.com> wrote:
>>
>> Hi Guys,
>>
>> Sorry I wasn't able to attend.  I'm back from SC now, if you need me.
>>
>> I have a concern about the current approach to revoking communicators.
>>  Consider a program that uses a library with a communicator, CL, that is
>> private to the library.  Process X makes a call to this library and
>> performs a wildcard receive on CL.  Process Y fails; Y would have sent a
>> message to X on CL.  Process Z sees that Y failed, but it sees it in the
>> user code, outside of the library.  Process Z cannot call revoke on CL
>> because it does not have any knowledge about how the library is implemented
>> and it does not have a handle to CL.
>>
>> This seems like a situation that will result in deadlock, unless the
>> library is also extended to include a "respond to process failure"
>> function.  Is this handled in some other way, and I'm just not seeing it?
>>
>> It seems like the revoke(comm) approach requires the programmer to know
>> about all communication and all communicators/windows in use in their
>> entire application, including those contained within libraries.  Is that a
>> correct assessment?
>>
>>  ~Jim.
>>
>>
>> On Wed, Nov 20, 2013 at 2:39 PM, Aurélien Bouteiller <
>> bouteill at icl.utk.edu> wrote:
>>
>>> Rich, this is a followup of the proofreading work done during the
>>> regular meeting we had last week, and everybody, including SC attendees,
>>> had a chance to join. I am sorry you couldn’t.
>>>
>>> Anyway, here is the working document for today: all diffs since the
>>> introduction of the new RMA chapter 5 month ago.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Le 19 nov. 2013 à 17:07, Richard Graham <richardg at mellanox.com> a écrit
>>> :
>>>
>>> > With SC this week this is poor timing
>>> >
>>> > Rich
>>> >
>>> > ------Original Message------
>>> > From: Wesley Bland
>>> > To: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>> > Cc: MPI WG Fault Tolerance and Dynamic Process Control working Group
>>> > ReplyTo: MPI WG Fault Tolerance and Dynamic Process Control working
>>> Group
>>> > Subject: Re: [mpiwg-ft] A meeting this week
>>> > Sent: Nov 19, 2013 2:13 PM
>>> >
>>> > Ok. I'll be there. I'll send it off for an editing today.
>>> >
>>> > Wesley
>>> >
>>> >> On Nov 19, 2013, at 3:12 PM, Aurélien Bouteiller <
>>> bouteill at icl.utk.edu> wrote:
>>> >>
>>> >> Dear WG members,
>>> >>
>>> >> We have been misreading the new forum rules. We have to buckle the
>>> text of the proposal this week and not in 2 weeks from now, so time is
>>> running short. I would like to invite you to a supplementary meeting
>>> tomorrow to make a review of the text together.
>>> >>
>>> >> Jim, I don’t know if you will be able to attend on short notice, but
>>> your input would be greatly appreciated.
>>> >>
>>> >> Date: Wed, November 20,
>>> >> Time: 3pm EDT/New York
>>> >> Dial-in information: 712-432-0360
>>> >> Code: 623998#
>>> >>
>>> >> Agenda:
>>> >> Review of ULFM text and final work.
>>> >>
>>> >> Aurelien
>>> >>
>>> >> --
>>> >> * Dr. Aurélien Bouteiller
>>> >> * Researcher at Innovative Computing Laboratory
>>> >> * University of Tennessee
>>> >> * 1122 Volunteer Boulevard, suite 309b
>>> >> * Knoxville, TN 37996
>>> >> * 865 974 9375
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> mpiwg-ft mailing list
>>> >> mpiwg-ft at lists.mpi-forum.org
>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>> > _______________________________________________
>>> > mpiwg-ft mailing list
>>> > mpiwg-ft at lists.mpi-forum.org
>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>> > _______________________________________________
>>> > mpiwg-ft mailing list
>>> > mpiwg-ft at lists.mpi-forum.org
>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>
>>> --
>>> * Dr. Aurélien Bouteiller
>>> * Researcher at Innovative Computing Laboratory
>>> * University of Tennessee
>>> * 1122 Volunteer Boulevard, suite 309b
>>> * Knoxville, TN 37996
>>> * 865 974 9375
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> mpiwg-ft mailing list
>>> mpiwg-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131122/5e2de052/attachment-0001.html>


More information about the mpiwg-ft mailing list