[mpiwg-ft] A meeting this week

Jim Dinan james.dinan at gmail.com
Fri Nov 22 12:53:12 CST 2013


The latter case requires a failure detector, right?  I had thought the
current design would avoid this requirement.

 ~Jim.


On Thu, Nov 21, 2013 at 12:06 PM, Wesley Bland <wbland at mcs.anl.gov> wrote:

> No problem about missing the call yesterday. It was a last minute thing. I
> think we're in good shape to submit the text on Monday, but we're just
> doing some final passes over the text. There were a few changes that
> Aurélien will be making and I'm getting Gail to do an English pass, but
> overall, it's still essentially the same as what we read last year (per the
> request from the forum).
>
> There are a couple of ways out of this deadlock. The first is, as you
> mentioned, to have a function in the library to essentially manually
> trigger an error handler and let the library figure out what is wrong. This
> method would work, but it is a bit heavy handed. The alternative solution
> is that the wildcard on process X should return an error because the
> failure of process Y meets the definition if an "involved process." Process
> X will get an exception (or an MPI_Errhandler) and can trigger the recovery
> path.
>
> Either way should work, but the later is obviously the preferred and
> expected solution.
>
> Thanks,
> Wesley
>
> On Nov 21, 2013, at 10:58 AM, Jim Dinan <james.dinan at gmail.com> wrote:
>
> Hi Guys,
>
> Sorry I wasn't able to attend.  I'm back from SC now, if you need me.
>
> I have a concern about the current approach to revoking communicators.
>  Consider a program that uses a library with a communicator, CL, that is
> private to the library.  Process X makes a call to this library and
> performs a wildcard receive on CL.  Process Y fails; Y would have sent a
> message to X on CL.  Process Z sees that Y failed, but it sees it in the
> user code, outside of the library.  Process Z cannot call revoke on CL
> because it does not have any knowledge about how the library is implemented
> and it does not have a handle to CL.
>
> This seems like a situation that will result in deadlock, unless the
> library is also extended to include a "respond to process failure"
> function.  Is this handled in some other way, and I'm just not seeing it?
>
> It seems like the revoke(comm) approach requires the programmer to know
> about all communication and all communicators/windows in use in their
> entire application, including those contained within libraries.  Is that a
> correct assessment?
>
>  ~Jim.
>
>
> On Wed, Nov 20, 2013 at 2:39 PM, Aurélien Bouteiller <bouteill at icl.utk.edu
> > wrote:
>
>> Rich, this is a followup of the proofreading work done during the regular
>> meeting we had last week, and everybody, including SC attendees, had a
>> chance to join. I am sorry you couldn’t.
>>
>> Anyway, here is the working document for today: all diffs since the
>> introduction of the new RMA chapter 5 month ago.
>>
>>
>>
>>
>>
>>
>>
>>
>> Le 19 nov. 2013 à 17:07, Richard Graham <richardg at mellanox.com> a écrit :
>>
>> > With SC this week this is poor timing
>> >
>> > Rich
>> >
>> > ------Original Message------
>> > From: Wesley Bland
>> > To: MPI WG Fault Tolerance and Dynamic Process Control working Group
>> > Cc: MPI WG Fault Tolerance and Dynamic Process Control working Group
>> > ReplyTo: MPI WG Fault Tolerance and Dynamic Process Control working
>> Group
>> > Subject: Re: [mpiwg-ft] A meeting this week
>> > Sent: Nov 19, 2013 2:13 PM
>> >
>> > Ok. I'll be there. I'll send it off for an editing today.
>> >
>> > Wesley
>> >
>> >> On Nov 19, 2013, at 3:12 PM, Aurélien Bouteiller <bouteill at icl.utk.edu>
>> wrote:
>> >>
>> >> Dear WG members,
>> >>
>> >> We have been misreading the new forum rules. We have to buckle the
>> text of the proposal this week and not in 2 weeks from now, so time is
>> running short. I would like to invite you to a supplementary meeting
>> tomorrow to make a review of the text together.
>> >>
>> >> Jim, I don’t know if you will be able to attend on short notice, but
>> your input would be greatly appreciated.
>> >>
>> >> Date: Wed, November 20,
>> >> Time: 3pm EDT/New York
>> >> Dial-in information: 712-432-0360
>> >> Code: 623998#
>> >>
>> >> Agenda:
>> >> Review of ULFM text and final work.
>> >>
>> >> Aurelien
>> >>
>> >> --
>> >> * Dr. Aurélien Bouteiller
>> >> * Researcher at Innovative Computing Laboratory
>> >> * University of Tennessee
>> >> * 1122 Volunteer Boulevard, suite 309b
>> >> * Knoxville, TN 37996
>> >> * 865 974 9375
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> mpiwg-ft mailing list
>> >> mpiwg-ft at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>> > _______________________________________________
>> > mpiwg-ft mailing list
>> > mpiwg-ft at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>> > _______________________________________________
>> > mpiwg-ft mailing list
>> > mpiwg-ft at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>> --
>> * Dr. Aurélien Bouteiller
>> * Researcher at Innovative Computing Laboratory
>> * University of Tennessee
>> * 1122 Volunteer Boulevard, suite 309b
>> * Knoxville, TN 37996
>> * 865 974 9375
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> mpiwg-ft mailing list
>> mpiwg-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
>
> _______________________________________________
> mpiwg-ft mailing list
> mpiwg-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20131122/8c0c2787/attachment-0001.html>


More information about the mpiwg-ft mailing list