[Mpi3-ft] Latest version of chapter

Fri Oct 21 23:18:41 CDT 2011

PS. Small extension of the summary at the bottom of my message below, to cross all t's: "... should fire the resp. _inheritable_ failure handlers ..."

-----Original Message-----
From: Supalov, Alexander
Sent: Saturday, October 22, 2011 6:01 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: RE: [Mpi3-ft] Latest version of chapter

Hi,

I think what emerges is an approach to a good solution. There are some loose ends, though. Going thru the discussion:

>>>> [Darius] The handlers are freed when the comm/win/file they're attached to is freed, so you'll never get a handler called with a comm/win/file that's invalid.

Let's clarify this. Imagine I had commA with some process X in it. X failed when commA was being destroyed. I did not get the notification in the resp. call on my process Y because it had already returned there. Later I, being totally unaware of the process X failure, try to create a new commB with process X in it. Will I notice that it's gone? How will I detect that now technically as an MPI implementor? Probably, I'll track process state deep inside the library. How will I detect that as an MPI user - when my handlerA is gone? Probably, I want handlerB, if available, to be called at this moment. I hope this example is reasonable as it partially underpins the remaining comments below.

>>> [Josh] A few things that probably should be clarified with the new FailHandler:
>>> - Is it inherited by new communicators like other error handlers?
>>
>> [Darius] I'd say no.  Because all it would do is call the same handler once for every communicator for the failure of the same process.
>
>[Josh] I agree.

Why this exception? A handler can trivially be made smart enough to notice it's being called over and over again in reaction to the death of one and the same process X. Actually, it does not need to be smart at all: when called, it will deal with the resp. communicator/file/win. It's the library that needs to be smart to mark process X failure only once and make sure all resp. communicators get notified in due time.

Moreover, if inheritance is disabled in this case, we're more likely to get wrong programs. People will just keep forgetting about this special feature of the failure handlers, and fail in an uncontrolled fashion despite their best effort - a few weeks into the run.

By excluding the inheritance here we create an exception to the rule that should be justified by more than our assumption of what the handler will do I think.

>>> [Josh] If we restrict it to only be called from an operation that uses that comm/win/file, then it would only fire one handler.  If multiple failures happen since the last time you made a call with a particular comm/win/file, then it should only be called once (not once per failure), because the user can find out about all failed processes at that time.

Again, this is not something I'd do. We may force the user to react to many process failures in one handler call. This is not what they may want to do. It's easier to write a handler that deals with one process failure at a time, and let the library care of invoking it in the relatively rare case of multiple process failures. Even if we think about multi- and manycore cases, the number of failed processes will normally be limited by the number of cores per node, which is not going to be astronomic for quite a while I guess. I.e., we don't seem to be creating a scalability bottleneck here.

If we do care about scalability even here, we should rather give the user a way to tell the library: "Yeah, I've dealt with this dead process X you've just reported. By the way, I've also dealt with processes Y1 to Yn you haven't reported yet, but you'll understand, right?" Then, and only then may the library skip notifying the user of the process Y1 thru Yn failures on the same comm/file/win . It still has to report all failures on other comm/file/win, though.

To simplify this, the library might report not one but a list of failures in the failure handler, or at least their total number for the moment of the handler invocation. Then the user will be well equipped to handle them all in one go. If he misses one, by chance or by sloppiness or by race condition, he'll get a shout (or shouts) next time.

I'm not as deep into this proposal to judge at the moment if this case is already covered, so, take this as a user request. They will mostly be rather naive about MPI FT at first, but they will have healthy instincts, common sense, and some MPI experience for sure. The above treatise is a reflection on what I think they will expect of us on this backdrop.

> [Josh] Thinking a bit about the implementation it should not be too bad to
> track such things. We could keep a boolean (or do some fun function
> pointer hacking) on the communicator that is flipped whenever a new
> failure is detected, then flip it back after firing the error handler.
> Similar boolean to what we might use to disable collectives.

Looking above and summarizing, we probably need to track a process status first and once somewhere inside the implementation. When one process fails, all resp. communicators should be marked as potentially problematic, or another, more scalable mechanism should be used in reactive, on-demand fashion. Then the resp. calls on them should fire the resp. failure handlers at appropriate times (pt2pt op w/ the failed process, a collective op invocation, etc.) w/o any assumptions as to how much the user will want to do there. The users should in turn have a way to tell the library that they've dealt with more than just the immediately reported failure(s) that caused the handler to be invoked in the first place.

Best regards.

Alexander

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
Sent: Saturday, October 22, 2011 12:20 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Latest version of chapter

On Fri, Oct 21, 2011 at 6:01 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
>
> On Oct 21, 2011, at 4:18 PM, Josh Hursey wrote:
>
>> The original problem was that the application wanted uniform
>> notification of a process failure (restricted to the set of processes
>> in the group associated with the communication objection). The current
>> error handlers are only fired when interacting with the failed process
>> directly (P2P) or indirectly (ANY_SOURCE, collectives).
>>
>> The requesters (who I believe are on the list and may want to pipe up)
>> were ok with having the callback triggered at an MPI boundary - so not
>> really asynchronous just not associated with the call.
>>
>> Maybe it is enough to restrict the notification to operations on the
>> communicator. So the FailHandler registered on commA is only fired
>> when commA is being used. The application would have to register the
>> FailHandler on all communicators that it is using and wants
>> notification from. But that would preserve some separation between the
>> library and application.
>
> I think that makes sense.

Why don't we start with that restriction, and run it by folks next week.

>
>>
>> A few things that probably should be clarified with the new FailHandler:
>> - Is it inherited by new communicators like other error handlers?
>
> I'd say no.  Because all it would do is call the same handler once for every communicator for the failure of the same process.

I agree.

>
>> - Without the communicator scope restriction mentioned above, if a
>> process fails, does it fire all of the FailHanders registered on
>> communication objects containing that process? (I think yes) If so, we
>> should probably state that we do not guarantee any ordering of these
>> calls.
>
> If we restrict it to only be called from an operation that uses that comm/win/file, then it would only fire one handler.  If multiple failures happen since the last time you made a call with a particular comm/win/file, then it should only be called once (not once per failure), because the user can find out about all failed processes at that time.

That sounds good to me.

Thinking a bit about the implementation it should not be too bad to
track such things. We could keep a boolean (or do some fun function
pointer hacking) on the communicator that is flipped whenever a new
failure is detected, then flip it back after firing the error handler.
Similar boolean to what we might use to disable collectives.

>
>> - In the function signatures the errhandler is 'int'/'integer' and
>> should probably be MPI_Errhandler or similar handle.
>
> I copied the prototypes from the error handler section.

In section 8.3.1 of MPI 2.2 they are of the type MPI_Errhandler. So
pointers to the error handler function prototype. We can probably use
the same function pointer signature and object for these new
functions.

>
>> - On the topic of what functions you can use inside, we can probably
>> use the language from the error handlers. I think it allows the user
>> to do pretty much anything they want, though I'd have to double check.
>> It might be that the standard is silent on this point, so no specific
>> restrictions are defined.
>
> I didn't see any restrictions, but then the standard says that all bets are off when you get an error, so calling anything at that point is undefined.

I think staying silent for now is a good idea. But maybe we can think
about it over the weekend and talk more about it next week.

Darius: Do you have some time to make some of these changes to the
chapter and post a new copy of the document to the ticket? We probably
want to whole MPI standard text since some text changed outside of the
chapter for MPI_Finalize stuff. If not, I can probably get to it late
this evening, or tomorrow.

Thanks,
Josh

>
> -d
>
>
>>
>>
>> -- Josh
>>
>>
>> On Fri, Oct 21, 2011 at 4:54 PM, Supalov, Alexander
>> <alexander.supalov at intel.com> wrote:
>>> Imagine I use some data protection scheme inside B. I won't be affected by "wrong" libraries that I call before or after my protection is on. I may be affected by an asynchronous call out of "another world" that is possible if handlerA is called from within my library B. I.e., in the sequence
>>>
>>> A-B-A
>>>
>>> this extension allows B to be "hacked" by A by just killing one process at the right time. Moreover, I can clean up the callbacks by using MPI_Comm_create instead of MPI_Comm_dup. I cannot prevent an asynchronous handler from being called.
>>>
>>> -----Original Message-----
>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>> Sent: Friday, October 21, 2011 10:44 PM
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> Subject: Re: [Mpi3-ft] Latest version of chapter
>>>
>>>
>>> On Oct 21, 2011, at 3:19 PM, Supalov, Alexander wrote:
>>>
>>>> Thanks. See below (prefix "AS>").
>>>>
>>>> -----Original Message-----
>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>> Sent: Friday, October 21, 2011 9:57 PM
>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>> Subject: Re: [Mpi3-ft] Latest version of chapter
>>>>
>>>>
>>>> Let's say you have commA used by library A and commB used by library B.
>>>> Library A has registered the proc failure handler called handlerA on commA.
>>>>
>>>> Now, let's say a process that's in commA but not commB failed, and the thread is executing in library B and calls, e.g.,  MPI_Send(..., commB).
>>>>
>>>> The MPI implementation performs the MPI_Send operation normally, then calls handlerA(commA, MPI_ERR_PROC_FAIL_STOP), and returns from MPI_Send normally.
>>>>
>>>> While in handlerA, the subject communicator (commA) is passed as a parameter, so it won't be out of scope.
>>>>
>>>> Is it a problem that library A's handler is called from "within" library B?
>>>>
>>>> AS> Sure. This handler may have been written by someone else who does not know me or my B or anything else. I may not even want it to be called from within my library B for security reasons. What if it unwinds the stack, connects to A's HQ, and dumps my confidential memory all over there?
>>>
>>> Yikes!  Don't link with libraries you don't trust :-)
>>>
>>> I don't know how to handle this case, but does the current standard prevent a library from snooping memory from other libraries?  A library could set an attribute with a copy callback function on comm_world.  That would be called from within another library's stack if that library tries to dup comm_world.
>>>
>>>> Moreover, by the time it's called, both A and commA may be the thing of the times long gone together with the context in which handlerA was supposed to be executed. What will it try to handle then and under what assumptions? I don't know. You?
>>>
>>> The handlers are freed when the comm/win/file they're attached to is freed, so you'll never get a handler called with a comm/win/file that's invalid.
>>>
>>> -d
>>>
>>>
>>>> -d
>>>>
>>>>
>>>> On Oct 21, 2011, at 2:42 PM, Supalov, Alexander wrote:
>>>>
>>>>> Not really. How do you want the user make sense of that? E.g., I call A on commA, fail on commA asynchronously while calling a totally unrelated B on commB that has no failures in it, and am kicked out of B into someone else's error handler saying some "A" on "comma" failed? And what now? I may even have A and commA out of scope by then, possibly forever.
>>>>>
>>>>> -----Original Message-----
>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
>>>>> Sent: Friday, October 21, 2011 9:35 PM
>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>> Subject: Re: [Mpi3-ft] Latest version of chapter
>>>>>
>>>>> With (regular) error handlers, they'll be called from within the function that raises the error.  With failure notification, because they're being called as a result of an external event (process failure), you could be called from within any function, even one not related to the comm/file/win that you registered the process failure notification handler on.
>>>>>
>>>>> Does that make sense?
>>>>>
>>>>> -d
>>>>>
>>>>> On Oct 21, 2011, at 1:51 PM, Sur, Sayantan wrote:
>>>>>
>>>>>> 17.5.1:11-12 - "The error handler function will be called by the MPI implementation from within the context of some MPI function that was called by the user."
>>>>>>
>>>>>> Maybe we should that error handlers are called from MPI functions that are associated with that comm/file/win?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>>>>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
>>>>>>> Sent: Friday, October 21, 2011 10:28 AM
>>>>>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>>>>>> Subject: Re: [Mpi3-ft] Latest version of chapter
>>>>>>>
>>>>>>> I just wanted to note that we want to distribute a copy of this
>>>>>>> chapter to the MPI Forum before the meeting. As such we are planning
>>>>>>> on sending out a copy at COB today (so Friday ~5:00 pm EDT) so that
>>>>>>> people have an opportunity to look at the document before the Monday
>>>>>>> plenary. So please send any edits or comments before COB today, so we
>>>>>>> can work them into the draft.
>>>>>>>
>>>>>>> We will post the draft to the ticket, so that people know where to
>>>>>>> look for the current draft.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Josh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 20, 2011 at 7:06 PM, Darius Buntinas <buntinas at mcs.anl.gov>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Here's the latest version of the FT chapter is on the wiki (it's in a
>>>>>>> new location on the main FT page under "ticket #276".  Please have a
>>>>>>> look and comment.
>>>>>>>>
>>>>>>>> Here's a direct link to the PDF:
>>>>>>>>  https://svn.mpi-forum.org/trac/mpi-forum-web/raw-
>>>>>>> attachment/wiki/FaultToleranceWikiPage/ft.pdf
>>>>>>>>
>>>>>>>> Here's a summary of the changes Josh and I made:
>>>>>>>>
>>>>>>>> * Minor wording touchups
>>>>>>>> * Added new semantic for MPI_ANY_SOURCE with the
>>>>>>> MPI_ERR_ANY_SOURCE_DISABLED error code
>>>>>>>> * Coverted wording for all comm, win, fh creation operations to not
>>>>>>> require collectively active communicators (eliminate requirement for
>>>>>>> synchronization)
>>>>>>>> * Added missing reader_lock to ANY_SOURCE example
>>>>>>>> * Added case for MPI_WIN_TEST
>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>> One-sided section
>>>>>>>>  clarified that window creation need not be blocking
>>>>>>>>  clarified that RMA ops might not complete correctly even if
>>>>>>>>    synchronization ops complete without error due to process
>>>>>>>>    failures
>>>>>>>> Process failure notification
>>>>>>>>  Added section describing new functions to add callbacks to comms,
>>>>>>>>    wins and files that are called when proc failure is detected
>>>>>>>> Other wordsmithing/cleanup changes
>>>>>>>>
>>>>>>>> -d
>>>>>>>> _______________________________________________
>>>>>>>> mpi3-ft mailing list
>>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> mpi3-ft mailing list
>>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpi3-ft mailing list
>>>>>> mpi3-ft at lists.mpi-forum.org
>>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>> --------------------------------------------------------------------------------------
>>>>> Intel GmbH
>>>>> Dornacher Strasse 1
>>>>> 85622 Feldkirchen/Muenchen, Deutschland
>>>>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>>>>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>>>>> Registergericht: Muenchen HRB 47456
>>>>> Ust.-IdNr./VAT Registration No.: DE129385895
>>>>> Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> --------------------------------------------------------------------------------------
>>>> Intel GmbH
>>>> Dornacher Strasse 1
>>>> 85622 Feldkirchen/Muenchen, Deutschland
>>>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>>>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>>>> Registergericht: Muenchen HRB 47456
>>>> Ust.-IdNr./VAT Registration No.: DE129385895
>>>> Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> --------------------------------------------------------------------------------------
>>> Intel GmbH
>>> Dornacher Strasse 1
>>> 85622 Feldkirchen/Muenchen, Deutschland
>>> Sitz der Gesellschaft: Feldkirchen bei Muenchen
>>> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>>> Registergericht: Muenchen HRB 47456
>>> Ust.-IdNr./VAT Registration No.: DE129385895
>>> Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>

--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
--------------------------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland 
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052