[Mpi3-ft] New Revision of Run-Through Stabilization Proposal (Dec. 9, 2011)

Josh Hursey jjhursey at open-mpi.org
Fri Dec 9 14:28:24 CST 2011

A new version of the document is available on the following wiki page:

Dec. 9, 2011	FTWG-Process-FT-Draft-2011-12-09.pdf

Change Log:
 * Typo fixes.
 * 10.5.4: Replaced "not defined" with "equivalent to that of process
failure as defined in Chapter 17".

 * 17.5.1: Clarify that the FailHandler's are not restricted to the
context of the communicator in use. The FailHandler is triggered on an
MPI operation with out regard to the communication context used in the
MPI operation.
 * 17.5.1: Note that FailHandlers will be called in the same order at
all processes, but the order is determined by MPI.
 * 17.5.1: Lifted restrictions on what can be called in the
FailHandler. Added an exception list, which just includes MPI_Finalize
at the moment.

 * 17.5.6: Pulled MPI_Init wording into a separate ticket:
 * 17.5.6: Replaced MPI_Init wording with an Advice to Implementors
about process failure.
 * 17.5.6: Pulled the mpiexec wording into a separate ticket:

 * 17.6: Added some clarifications for MPI_ANY_SOURCE. That pending
requests can match while any_source is disabled. Once matched they are
free to return successfully to the application even if any_source is
still disabled.
 * 17.6: Clarified that MPI_Comm_drain must be able to tolerate
process failure, and does not need a collectively active communicator.

 * 17.7.1: Clarified that MPI_Comm_validate must be able to tolerate
process failure, and does not need a collectively active communicator.

Open Discussion Items:
 * New name for MPI_Comm_reenable_any_source() -- See forthcoming email
 * Should we merge MPI_Comm_group_failed() and
MPI_Comm_reenable_any_source()? So that calling
MPI_Comm_group_failed() also re-enables MPI_ANY_SOURCE operations/
 * 3.10 & 17.6.2 : Do these sections conflict? Should the status only
be associated with the 'source' since MPI_Recv would have returned the
status value if the operations were called separately?
 * 17.6.2: The MPI_ANY_SOURCE discussion paragraph is long, and
complex. We might want to think about how to simplify the wording.
 * 17.5.1: Should the FailHandler be set collectively, or just locally?
 * 17.5.1: Should the FailHandler be called only with a consistent
group of failed processes? So the internal call to the FailHandler be
required do call a comm_validate to reach consensus before triggering
the callbacks?
 * 17.5.1: Things would be easier if -all- FailHandlers were triggered
when a process fails. Regardless of if the associated communication
object contains a failure or not.

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list