[Mpi3-ft] Latest version of chapter

Darius Buntinas buntinas at mcs.anl.gov
Fri Oct 21 14:57:17 CDT 2011


Let's say you have commA used by library A and commB used by library B.
Library A has registered the proc failure handler called handlerA on commA.

Now, let's say a process that's in commA but not commB failed, and the thread is executing in library B and calls, e.g.,  MPI_Send(..., commB).

The MPI implementation performs the MPI_Send operation normally, then calls handlerA(commA, MPI_ERR_PROC_FAIL_STOP), and returns from MPI_Send normally.

While in handlerA, the subject communicator (commA) is passed as a parameter, so it won't be out of scope.

Is it a problem that library A's handler is called from "within" library B?

-d


On Oct 21, 2011, at 2:42 PM, Supalov, Alexander wrote:

> Not really. How do you want the user make sense of that? E.g., I call A on commA, fail on commA asynchronously while calling a totally unrelated B on commB that has no failures in it, and am kicked out of B into someone else's error handler saying some "A" on "comma" failed? And what now? I may even have A and commA out of scope by then, possibly forever.
> 
> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Darius Buntinas
> Sent: Friday, October 21, 2011 9:35 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] Latest version of chapter
> 
> With (regular) error handlers, they'll be called from within the function that raises the error.  With failure notification, because they're being called as a result of an external event (process failure), you could be called from within any function, even one not related to the comm/file/win that you registered the process failure notification handler on.
> 
> Does that make sense?
> 
> -d
> 
> On Oct 21, 2011, at 1:51 PM, Sur, Sayantan wrote:
> 
>> 17.5.1:11-12 - "The error handler function will be called by the MPI implementation from within the context of some MPI function that was called by the user."
>> 
>> Maybe we should that error handlers are called from MPI functions that are associated with that comm/file/win?
>> 
>> 
>> 
>>> -----Original Message-----
>>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
>>> bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
>>> Sent: Friday, October 21, 2011 10:28 AM
>>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>>> Subject: Re: [Mpi3-ft] Latest version of chapter
>>> 
>>> I just wanted to note that we want to distribute a copy of this
>>> chapter to the MPI Forum before the meeting. As such we are planning
>>> on sending out a copy at COB today (so Friday ~5:00 pm EDT) so that
>>> people have an opportunity to look at the document before the Monday
>>> plenary. So please send any edits or comments before COB today, so we
>>> can work them into the draft.
>>> 
>>> We will post the draft to the ticket, so that people know where to
>>> look for the current draft.
>>> 
>>> Thanks,
>>> Josh
>>> 
>>> 
>>> 
>>> On Thu, Oct 20, 2011 at 7:06 PM, Darius Buntinas <buntinas at mcs.anl.gov>
>>> wrote:
>>>> 
>>>> Here's the latest version of the FT chapter is on the wiki (it's in a
>>> new location on the main FT page under "ticket #276".  Please have a
>>> look and comment.
>>>> 
>>>> Here's a direct link to the PDF:
>>>>   https://svn.mpi-forum.org/trac/mpi-forum-web/raw-
>>> attachment/wiki/FaultToleranceWikiPage/ft.pdf
>>>> 
>>>> Here's a summary of the changes Josh and I made:
>>>> 
>>>> * Minor wording touchups
>>>> * Added new semantic for MPI_ANY_SOURCE with the
>>> MPI_ERR_ANY_SOURCE_DISABLED error code
>>>> * Coverted wording for all comm, win, fh creation operations to not
>>> require collectively active communicators (eliminate requirement for
>>> synchronization)
>>>> * Added missing reader_lock to ANY_SOURCE example
>>>> * Added case for MPI_WIN_TEST
>>>> 
>>>> and
>>>> 
>>>> One-sided section
>>>>   clarified that window creation need not be blocking
>>>>   clarified that RMA ops might not complete correctly even if
>>>>     synchronization ops complete without error due to process
>>>>     failures
>>>> Process failure notification
>>>>   Added section describing new functions to add callbacks to comms,
>>>>     wins and files that are called when proc failure is detected
>>>> Other wordsmithing/cleanup changes
>>>> 
>>>> -d
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> --------------------------------------------------------------------------------------
> Intel GmbH
> Dornacher Strasse 1
> 85622 Feldkirchen/Muenchen, Deutschland 
> Sitz der Gesellschaft: Feldkirchen bei Muenchen
> Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> Registergericht: Muenchen HRB 47456 
> Ust.-IdNr./VAT Registration No.: DE129385895
> Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list