[Mpi3-ft] Communicator Creation Behavior

Thu Dec 15 16:09:24 CST 2011

For the past week or so we have been struggling with a question of
desired behavior in the presence of failures for a communication
construction operation.

Consider the following code sketch:
------------------------
MPI_Comm_dup(MPI_COMM_WORLD, comm[0]);
for(i=1; i < max; ++i) {
  r = MPI_Comm_dup(comm[i-1], comm[i]);
  if( r != MPI_SUCCESS ) {
    MPI_Comm_validate(comm[i-1])
    break;
  }

  r = MPI_Allreduce(..., comm[i], ...);
  if( r != MPI_SUCCESS ) {
    MPI_Comm_validate(comm[i])
    break;
  }
}
------------------------
Is this an incorrect sketch because of the 'breaks'? (Are the
validates necessary?)

Assume for the moment that MPI_Comm_dup() is collective -and-
synchronizing. This allows the communication creation operation to
guarantee that the communicator is either created everywhere or
nowhere.

If a process failure occurs some processes will be in the
MPI_Comm_dup() at iteration X while others are in the MPI_Allreduce()
at iteration (X-1). The Allreduce will complete with an error since
the communicator becomes collectively inactive, but some could have
left early after the decision (like the leader).

So the process stack looks like:
P0                        P1
---------------        ----------------
Dup(comm[X-1])         Dup(comm[X-1])
MPI_Allreduce()        MPI_Allreduce()
Dup(comm[X])           -> Error
 -> Error

So should P1 be required to call Dup(comm[X])?

Option (No): Behave like a collective
 - The operation will fail (similar to the behavior of collectives),
and continue to fail until the next MPI_Comm_validate() on the input
communicator.
 - This means that communication creation operations are required to
have as input a collectively active communicator.
 - If a process failure occurs during the operation, then the
operation will complete in error.

Option (Yes): Behave like MPI_Comm_validate()
 - The operation will block until all processes have called it
(similar to MPI_Comm_validate() collective).
 - The input communicator need not be collectively active, since we
know all alive processes will eventually call it.
 - So if your applications intention is jump out of the loop, P1 must
first 'match' the Dup(comm[X]) even if they do not intend to use the
communicator.
 - In one variation of this: If an implementation returns an error
from Dup(comm[X]) in P0 (instead of blocking), then (after the
validate) it must return an error to the matching Dup(comm[X]) in P1.
As George mentioned on the call, an implementation could do this by
associating a counter with the number of calls to this creation
function. If the counter does not match the other processes in the
communicator then return an error until it does. This loosens the
blocking restriction a bit.

I have a slight preference for Option (No). I think it better fits how
a developer would expect to react to a failure condition, and it is
consistent with out behavior for other collective operations. However,
it is not without its complexities.

What do folks think?

-- Josh

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey