[Mpi3-ft] A few (more) notes from UTK about the RTS proposal

Tue Dec 13 15:40:22 CST 2011

Darius and I met again with the folks at UTK to discuss the proposal.
It was another productive meeting, and below are the notes for that
meeting. As you might expect, there are a few more items for
discussion.

Some of my notes are fuzzy. Maybe Darius and the folks from UTK can
help fill them out.

-- Josh

17.6: MPI_ANY_SOURCE
-------------------------
Problem: We need to decide and commit to a solution.

There are three options before us to deal with MPI_ANY_SOURCE (we
currently specify 3.b):
1) Do nothing.
2) Complete all outstanding MPI_ANY_SOURCE receives in error
3) Complete blocking MPI_ANY_SOURCE receives in error, and return a
warning from nonblocking MPI_ANY_SOURCE receives leaving them in the
matching queue.

And two variation on (2) and (3):
a) The reporting of the error is an implicit acknowledgement of the
process failure. Therefore once the error(/warning) is returned the
user can post and wait on MPI_ANY_SOURCE receives without calling a
separate function.
b) After reporting of the error, the MPI library does not permit the
user to post any new MPI_ANY_SOURCE operations until the user calls
MPI_Comm_reenable_any_source(). So explicit acknowledgement of the
error.

(1)
By doing nothing the user must detect failures and cancel/complete the
receive if they detect that it should no longer be posted. The user
can use a FailHandler as the detection mechanism if they want, and
either Drain all messages or send themselves a message to complete the
outstanding receive.

(2)
In this solution all outstanding MPI_ANY_SOURCE receives are completed
in error. The problem with this solution was its impact on the
matching order. If the user has:
Proc 1              Proc 2
---------------------------
Irecv(ANY_SRC, ANY_TAG)
Irecv(2, ANY_TAG)
Waitall()
  ----- Process 3 fails -----
                    Send(dest=1, tag=1)
                    Send(dest=1, tag=2)
Then the messages could match incorrectly. The discussion was that
this is a bad program (not erroneous though), and that if the user
wants to manage process failure they should just avoid this situation.
One nice thing about this solution is that it fits the [MPI_Recv =
(MPI_Irecv + MPI_Wait)] semantic constraint in the standard. The bad
thing, is that it messes with the matching order. Note that the second
Irecv(2, ANY_TAG) is -not- removed from the matching queue.

(3)
In this solution we complete blocking MPI_ANY_SOURCE receives in
error, and return a warning for nonblocking MPI_ANY_SOURCE receives
and keep them in the matching queue. The nice thing about this
solution is that it does not mess with the matching order. The bad
thing is that it introduces a new concept (a warning) and breaks the
[MPI_Recv = (MPI_Irecv + MPI_Wait)] semantic constraint in the
standard.

(a)
With the implicit acknowledgement of the error, we do not need to have
a MPI_Comm_reenable_any_source() operation. This option would make
example in section 17.6.3 incorrect for a multithreaded application.
The application would have to either lock around the MPI_Recv to
prevent the posting or more that one MPI_Recv at a time, or use
MPI_Irecv operations.

(b)
With explicit acknowledgement of the error, the user is protected
until they can assess the group of failed processes and determine if
waiting is the correct thing to do (per the example in section
17.6.3). Note that if the user wants the behavior of (a) then they can
call the MPI_Comm_reenable_any_source() operation in the error
handler.

My inclination is to have either (2) or (3) with option (b).

17.6: MPI_COMM_REENABLE_ANY_SOURCE
----------------------------------
Problem: The name of the function "MPI_COMM_REENABLE_ANY_SOURCE" is
misleading since posted MPI_Irecv's are left in the matching queue.

Rename to MPI_COMM_VALIDATE_ANY_SOURCE(). 'validate' being used to
indicate that the user is confirming that the library excluding this
set of failure is acceptable and will meet their needs going forward
if they post a new MPI_ANY_SOURCE receive operation.

Per the mailing list, also clarify that if it is used over an
intercommunicator then it returns the process failures in the remote
group.

17.5.1: Process Failure Handlers
----------------------------------
Problem: Need consistent calling behavior to allow the user to use
collective operations in the handler without exposing them to
deadlock. Additionally we need to make sure that the FailHandler is
called the same number of times, in the same order at all processes.

When reading the solutions below, keep in mind the following scenario:
You have 4 communicators each with their own FailHandlers registered:
A = {0, 1, 2, 3, 4}
B = {0, 1, 2, 3}
C = {0, 1, 2,    4}
D = {0, 1}

Processes 3 and 4 fail at nearly the same time. Because of how the
failure detector/notifier is implemented some processes find out about
the process failures in different orders.

Process 0 discovers that process 3 fails and enters the FailHandler stack: A->B
Process 1 discovers that process 4 fails and enters the FailHandler stack: A->C
Process 2 discovers that process 3 fails and enters the FailHandler stack: A->B

In the FailHandler of B and C the process is calling a collective
operation (e.g., MPI_Comm_validate, or MPI_Comm_drain). At this point
the two processes are deadlocked.

We came up with two solutions to the FailHandler problem:

Both solutions have the following clause:
No FailHandlers can be called during a call to a FailHandler. However,
if the consistent failure set is not complete, FailHandlers will be
called after completion of the current FailHandler. Each FailHandler
will be called the same number of times. FailHandlers are -not-
inherited during communicator creation.

(1)
FailHandler is called on -all- communicators with FailHandlers
registered in a consistent order at all processes. No globally
consistent failure information is known before the end of
MPI_Comm_validate(), if the user chooses to call it.

In the scenario above without using MPI_Comm_validate in the FailHandler:
Process 0 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
Process 1 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D
Process 2 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
 then
Process 0 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D
Process 1 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
Process 2 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D

It is possible that some processes find out about the both failures at
the same time so the following is possible:
Process 0 discovers that process 3,4 fails and enters the FailHandler
stack: A->B->C->D
Process 1 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D
Process 2 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
 then
Process 1 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
Process 2 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D

In the scenario at top with using MPI_Comm_validate in the FailHandler:
Process 0 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
  Process 0 discovers the failure of 4 during the MPI_Comm_validate()
Process 1 discovers that process 4 fails and enters the FailHandler
stack: A->B->C->D
  Process 1 discovers the failure of 3 during the MPI_Comm_validate()
Process 2 discovers that process 3 fails and enters the FailHandler
stack: A->B->C->D
  Process 2 discovers the failure of 4 during the MPI_Comm_validate()

So the user is in control over the synchronization of the failure set,
if they need it to be. All FailHandlers must be called in order to
avoid the deadlock condition mentioned in the example at top.

So even though communicator D contains no failed processes, its
FailHandler is called. This is not always a bad thing. If, for
example, the communication on D is no longer necessary then it may
want notification of process failures outside of its communication
context to cancel the communication.

(2)
FailHandler is called only on communicators with FailHandlers
registered that include the failed process in a consistent order at
all processes. The calling of the FailHandler stack is over the same
globally consistent set of process failures know at the time of call.

So the MPI library is calling MPI_Comm_validate internally to
determine the set of failures, and the subset of FailHandlers to call.
So we force a synchronization at every process failure.

In the scenario above the following would occur:
Process 0 discovers that process 3,4 fails and enters the FailHandler
stack: A->B->C
Process 1 discovers that process 3,4 fails and enters the FailHandler
stack: A->B->C
Process 2 discovers that process 3,4 fails and enters the FailHandler
stack: A->B->C

(1) is what was agreed upon in the meeting on Monday since it provides
the maximum flexibility to the user. The downside, if you see it that
way, is that -all- communicators are notified.

17.8.2: Communicator Management
----------------------------------
Problem: Communicator creation operations should be synchronizing in
the presence of process failure. Additionally, should the communicator
creation operation fail if some process fails during the operation?

(Sorry if this is a bit muddled, my notes are fuzzy here)

Consider the scenario below:
for(i = 0; i < 10; ++i) {
  MPI_Comm_dup(comm[i], comm[i+1]);
  MPI_Comm_set_failhandler(comm[i+1]);

  MPI_Send(comm[i+1]);
  ...
}

If a process failure happens some processes may be in MPI_Comm_dup()
in iteration (i=2) while some processes are still in MPI_Send() in
iteration (i=1). The FailHandler is triggered for some processes while
they are in MPI_Comm_dup() while others are in MPI_Send. In the
FailHandler, the user decides to break out the the loop. So it wants
the MPI_Comm_dup() to fail, and those processes in MPI_Send() to not
have to call the MPI_Comm_dup() to unblock it.

So the MPI_Comm_dup() should not work around emerging process
failures, but return an error when they first encounter a new failure.
All processes must synchronize to decide to return an error
consistently. So the processes that decided to fail in the MPI_Send()
are required to call the MPI_Comm_dup() to 'match' the communicator
creation collective with other processes so it can decide error. (We
called this 'going forward to go backward' in the meeting). What the
user wants is that the MPI_Comm_dup() will cancel without having to
call it from the processes that are in MPI_Send().

To further complicate things, consider if the MPI_Comm_dup() is not
synchronizing. For example, if it chooses the cid from a pre-decided
pool of cids. Then it is possible that some processes see the failure
in different iterations that may be more than one step away. If we
require the MPI_Comm_dup() to synchronize, then we cannot have this
type of optimization.

17.6: MPI_COMM_DRAIN
----------------------------------
I forget what the issue was here. I have the note that 'drain should
be a collective local operation'. But that was before the FailHandler
discussion during the meeting. Maybe the UTK folks can fill in here.

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey