[Mpi3-ft] A few notes from UTK about the RTS proposal

Josh Hursey jjhursey at open-mpi.org
Fri Dec 2 14:00:16 CST 2011

As I mentioned on Wednesday's call, Rich and I met with the folks at
UTK on Thursday to discuss the run-through stabilization proposal. It
was quite a productive meeting. Since we did not get through the whole
proposal, we are having another meeting on Dec. 12 from which I will
surely have more notes.

Below are a few notes that I gathered from the meeting that we should
discuss on the call, if not on the list beforehand.

So this is a bit long, and I am sure that I am simplifying/missing
details that were discussed in the meeting. UTK folks, if there are
any points that are not appropriately clear, pipe-up and let us know.


17.5.6: Startup: MPI_Init
Problem: This section is still unclear, and needs further clarification.

If MPI_Init fails to initialize the library it should return an error.
The user cannot call MPI_Init again, so what do we expect the user to
do in a fault tolerant application?

If MPI_Init encounters a process failure, but can initialize properly
then it should return success. The process failure notification should
be delayed until sometime after MPI_Init.

For process recovery, we may want to clarify "the number of processes
that the user asked for." This way if some number of processes F
failed during startup, then the application has the ability to restart
F processes within MPI_COMM_WORLD. We may not want to address this in
the stabilization proposal, but is something we should discuss for the
recovery proposal.

10.5.4: Connectivity
Problem: The sentiment was that by using MPI_Abort to only abort the
processes in the specified 'comm' then we are doing dangerous thing
with the notion of 'connected' as defined in this section.

Given the bulleted statement:
 "If a process terminates without calling MPI_FINALIZE, independent
processes are not affected but the effect on connected processes is
not defined."

We might want to clarify the 'is not defined' portion of this
statement. In this section there is an additional note about
MPI_Abort, in the proposal we are saying the MPI_Abort only terminates
the processes in the specified communicator (Section 8.3).

So we might need to look at this section and clarify as necessary.

Problem: The model is too complex and can be simplified.

A blocking MPI_Recv(ANY) will return an error when a new failure
occurs. A nonblocking MPI_Irecv(ANY) request, during the
MPI_Wait/Test, will return a 'warning' (MPI_ERR_ANY_SOURCE_DISABLED)
and not complete when a new failure occurs.

There was a debate if the user needs to call another function (like
MPI_Comm_reenable_any_source) to reactivate the ANY_SOURCE receives,
or if the fact that the error was returned to the user is sufficient
to reactivate ANY_SOURCE. The ones that returned a 'warning' error can
be matched/completed if while the user is handling the error a
matching send arrives. So this kind of just works like an interrupt.

Additionally, once an error is returned should we remove the
protection that "Any new MPI_ANY_SOURCE receive operations using a
communicator with MPI_ANY_SOURCE receive operations disabled will not
be posted, and marked with an error code of the class
MPI_ERR_PROC_FAIL_STOP." So new MPI_Recv(ANY) would post successfully.

The sentiment is that the user is notified of the process failure via
the return code from the MPI_Wait/Test, and if they do another
MPI_Wait/Test or post a new ANY_SOURCE receive they implicitly
acknowledge the failed process. To avoid a race condition where
another process fails between the error return and the MPI_Wait/Test,
the library needs to associate an 'epoch' value with the communicator
to make sure that the next MPI_Wait/Test returns another error.

I like the idea of the user doing an explicit second operation, like
MPI_Comm_reenable_any_source, since then we know that the user is
aware of the failed group (and seems more MPI-like to me). However, I
like the more flexible semantics. So I'm a bit split on this one.

Problem: The name of the function "MPI_COMM_REENABLE_ANY_SOURCE" is
misleading since posted MPI_Irecv's are left in the matching queue.

Posted MPI_Irecv(ANY)'s are not disabled when a new failure occurs and
reactivated when this function is called. The 'disable' and
'(re)enable' parlance seems to indicate that all of the operations
completed, when they may have not.

So the suggestion was that we talk in terms of
deactivating/reactivating ANY_SOURCE operations. Replacing this word
would put us close to the 30 character limit. I don't know if there is
better language we can use to communicate this idea or not.

Problem: If MPI_Comm_drain is collective and meant to be called in the
FailHandler then we need some ordering of when the FailHandlers are
triggered in order to avoid deadlocks.

This seemed to be less of a problem with MPI_Comm_drain and more of a
problem with FailHandler triggering (see below).

It was mentioned that this operation might be able to be a local
operation by associating a another matching bit ('epoch') to the
communication channel. Then incrementing the epoch will exclude all
previous messages from matching. We don't say that the collective
MPI_Comm_drain is synchronizing, so I think this would be a valid
implementation of this operation. Something to keep in mind.

We need to also clarify that MPI_Comm_drain is a fault tolerant
collective, so there is no need to call MPI_Comm_validate beforehand.

17.7.1: Validating Communicators
Problem: Would like to be able to fix all of the communication
problems resulting from a newly failed process outside of the critical

The sentiment was that with the FailHandler and
MPI_Comm_dup/MPI_Comm_free, one should be able to eliminate the need
to ever call MPI_Comm_validate. So why have it in the standard?

So in the FailHandler the user would call MPI_Comm_dup/MPI_Comm_split
on the communicator provided, and get a replacement that is
collectively active. Then they would MPI_Comm_free the old
communicator (maybe doing a MPI_Comm_drain to flush messages before
freeing it). Then replace the comm pointer with the new dup'ed

MPI_Comm_dup does not guarantee that the resulting communicator is
collectively active. So this might not work in the current proposal.
But if it did, then you might be able to get away with this trick.

17.8.2: Communicator Management
Problem: We make no statement on whether the newly created
communicator is collectively active or not.

If you pass in a collectively inactive communicator, it may be likely
that you get out a collectively inactive new communicator. Since we do
not know if all alive processes in the communicator know of the same
set of failed processes.

If we required that the new communicator be collectively active, then
we preclude the implementation of local functioning optimizations like
caching cids for MPI_Comm_dup.

If you pass in a collectively active communicator, then the user may
expect that the new communicator is also collectively active. If a new
process failure emerges during the creation operation then the
reporting of it may be delayed until after the creation operation has
completed. So pretending that the failure happened just after the
creation operation.

So do we need to explicitly clarify this point? Would making an advice
to implementors regarding 'the newly created communicator should be
collectively active' help address the problem?

17.5.1: Process Failure Handlers
Problem: FailHandlers are only called if a process fails on the
associated communication object. If the application is using many
different communicators each with a FailHandler then it is likely that
different FailHandlers will be triggered off of different
communicators. If the application uses MPI_Comm_drain in the
operation, this can lead to deadlock and other badness.

So this is a tough problem. What we want is to trigger all
FailHandlers including the process failure (maybe all FailHandlers
regardless of if they include the failed process)? Additionally, we
need to trigger then in a consistent order across all processes so
that we avoid deadlock scenarios.

It was proposed that we create something like a global partial
ordering of FailHandler calls. The FailHandler
registration/deregistration calls would be collective (so we can
coordinate the call ordering). The order of the calls is determined by
the MPI library, but the user is guaranteed that FailHandlers will be
called in the same order at all processes.

There is some pretty complex hand waving associated with doing this,
and I would do a disservice trying to capture here since it was not my
idea. But that is the general problem with FailHandlers that we need
to find a way to work around.

Additionally they wanted clarification that the FailHandler is
triggered with a globally consistent group of failed processes. So
this implies that the FailHandler is calling an agreement protocol
(like MPI_Comm_validate) behind the scenes to generate this level of
consistency. I would think that the MPI library would have to delay
reporting of additional process failures that may occur during the
FailHandler operation until after they are complete.

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list