[Mpi3-ft] MPI_Comm_drain_all operation

Tue Nov 1 09:13:42 CDT 2011

At the MPI Forum it was mentioned that once a process failure occurs
what an application may really want to do is cancel all
outstanding/inflight communication on a communicator regardless of if
the message is to an alive peer or not. This is because the
application is entering a recovery period and rolling back to some
previous state. The suggestion was to make this the required behavior
of the MPI implementation.

By forcing the cancelation of the outstanding/inflight messages, we
force the application into a rollback style of recovery. There are
other techniques that do not desire this type of behavior (e.g.,
manager/worker style computation, replication, other roll-forward
recovery techniques).

So I think that making this the default behavior of MPI will be overly
restrictive on the application, but it is a scenario that we can
better support. Currently if an application wants this type of
flushing behavior it must cancel/complete point-to-point messages
explicitly when a failure occurs. This can be cumbersome, and maybe we
can find a way to make it easier.

FT-MPI recognized the need for both styles of communication channel
handling. They supported two modes: CONTINUE which is basically what
we have specified so far, and RESET which cancels all inflight
messages. See Section 2.1.3 in [1] and Section 2.2 in [2] for further
explanation and examples of applications that prefer one over the
other.

I am proposing a new operation MPI_COMM_DRAIN_ALL.

MPI_COMM_DRAIN_ALL(comm)
 IN comm communicator to drain (handle)

MPI_COMM_DRAIN_ALL is a fault tolerant collective call that flushes
all point-to-point messages on a communicator, and cancels outstanding
operations that have not completed at the end of the call. This solves
the problem with canceling sends, since both sides are explicitly
participating in the operation and can agree upon completion.
Outstanding requests are either completed successfully, completed with
MPI_ERR_PROC_FAIL_STOP (if interacting with a failed process), or
canceled.

We could also think about combining MPI_COMM_DRAIN_ALL with the
MPI_COMM_VALIDATE through either an argument to the function or new
function call (MPI_COMM_VALIDATE_AND_DRAIN). This might be implemented
in similar way as the MPI_COMM_VALIDATE or using a technique like in
[3]. Or an implementation may choose to drop all messages that have
not been completed yet at either the request level and/or network
level.

I don't know if a 2-sided drain will be useful, but it might be.
MPI_COMM_DRAIN(comm, peer) which drains only the messages between the
caller and the specified peer. I see a strong usecase for
MPI_COMM_DRAIN_ALL, but wanted to mention this alternative in case
others have a good usecase for it.

There is a slight issue with unblocking outstanding communication when
a new failure happens. Consider:
Proc 0     |   Proc 1
-----------+-------------
Recv(1)    |
/******** Proc 2 fails ****/
           | Recv(2)
           | -> Error
           | MPI_Comm_drain_all()
Proc 0 is still waiting in Recv(1), and we need to interrupt it so
that it can join the MPI_Comm_drain_all() collective. I see two
solutions to this. I have a slight preference for the first since it
has the least number of questions associated with it.

First, we have the FailHandler which is triggered whenever a process
learns of a new failure. From within the FailHandler function we could
allow the user to use MPI_Comm_drain_all(). This would be the one
exception to the 'use only local operations' rule, and would allow the
user to complete the Recv(1) in the example above from inside the
FailHandler.

Secondly, we could generalize the MPI_ERR_ANY_SOURCE_DISABLED into a
general interrupt error condition (MPI_ERR_INTERRUPT). So similar to
how select() will return an EINTR when a signal occurs, so will MPI
when a process failure occurs. Users can then continue waiting on the
receive or cancel it - similar to what we currently specify for
MPI_ANY_SOURCE. There are issues with making this the default behavior
since if an application does not want this capability the interrupt
can distract their computation, and cause problems with matching
messages if they use blocking point-to-point operations.

What do you think?

-- Josh

[1] http://hpc.sagepub.com/cgi/content/abstract/19/4/465
[2] http://dl.acm.org/citation.cfm?id=1065944.1065973
[3] http://portal.acm.org/citation.cfm?id=214451.214456

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey