[mpiwg-ft] Ticket 324 June 2015 Reading

Fab Tillier ftillier at microsoft.com
Mon Jun 1 23:36:07 CDT 2015

I think propagating upwards is weird, and only there to work around poorly defined semantics at the cost of backward compatibility.

Take an application that makes a dup of MPI_COMM_WORLD, then changes the errhandler to MPI_ERRORS_RETURN.  If this app then frees the dup communicator handle after issuing an MPI_Irecv, it would currently likely expect MPI_ERRORS_RETURN to be invoked if that receive encounters an error, rather than MPI_ERRORS_ARE_FATAL that exists on MPI_COMM_WORLD.

If an applications has a custom error handler that has per-communicator state, it would need to either:
- keep track of outstanding requests on the communicator before freeing the handle
- change the error handler to one of the built-in ones before freeing
- call MPI_COMM_DISCONNECT to ensure all pending communications are complete (with potential ramifications for anything < MPI_THREAD_MULTIPLE)

I think clarifying that an application that requires per-communicator context in their error handler and calls MPI_COMM_FREE before all pending operations are complete is erroneous is probably a better scoped solution - MPI has never provided any mechanism to notify the application of when a communicator is actually freed, and clearly defines that it may not be freed when MPI_COMM_FREE returns.


-----Original Message-----
From: mpiwg-ft [mailto:mpiwg-ft-bounces at lists.mpi-forum.org] On Behalf Of Schulz Martin
Sent: Tuesday, June 2, 2015 5:06 PM
To: MPI WG Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [mpiwg-ft] Ticket 324 June 2015 Reading

Hi Wesley,

Thanks for the summary - I think this describes it fairly well.

One additional comment: there was some discussion on having to deal with
situations where there is/was an associated communicator, but that
communicator has been freed before the fault happened. The question was
which communicator an Abort should be on in this case. Personally, I think
it should propagate upwards to the next communicator in the hierarchy,
worst case to COMM_WORLD, but other options exist as well. It would be
good, though, to clearly define this case.


Martin Schulz, schulzm at llnl.gov, http://scalability.llnl.gov/
CASC @ Lawrence Livermore National Laboratory, Livermore, USA

On 6/1/15, 10:00 PM, "Bland, Wesley" <wesley.bland at intel.com> wrote:

>Notes from the ticket reading are now posted on the wiki:
>TL;DR - The reading did not ³pass², but we got lots of good feedback to
>come back with a new version. We should consider splitting this into two
>or three tickets. One to define new errhandlers that does the new
>definition (abort communicator) and one that¹s a more well defined old
>definition (abort MPI_COMM_WORLD). Another ticket will deprecate
>MPI_COMM_ERRORS_ARE_FATAL. Another ticket will consolidate the
>definitions of all of the error handling text to a single place.
>The rest of the details can be found in the wiki.
>I¹ll be working on some drafts over the next few days to try to get new
>versions of this ticket out for discussion. My tentative hope is to get
>this ready for a new plenary in September. There¹s going to be enough
>changes that this should probably get a plenary before another reading.
>Comments welcome.
>mpiwg-ft mailing list
>mpiwg-ft at lists.mpi-forum.org

mpiwg-ft mailing list
mpiwg-ft at lists.mpi-forum.org

More information about the mpiwg-ft mailing list