[Mpi3-ft] Communicator Creation Behavior

Fri Dec 16 08:13:53 CST 2011

The example is slightly incorrect. The validates should be removed,
since a failure in the last iteration would make matching them
difficult.

So consider the following code sketch instead:
------------------------
MPI_Comm_dup(MPI_COMM_WORLD, comm[0]);
for(i=1; i < max; ++i) {
 r = MPI_Comm_dup(comm[i-1], comm[i]);
 if( r != MPI_SUCCESS ) {
   break;
 }

 r = MPI_Allreduce(..., comm[i], ...);
 if( r != MPI_SUCCESS ) {
   break;
 }
}
------------------------

Thinking about this a bit more last night. I would be ok with 'Option
(Yes)' where the user is required to call the communicator creation
operation to match across a new process failure. So it would behave
like MPI_Comm_validate() with regard to matching. It seems a bit
unnatural, since the user is forced to do something that it may not
want to do because the library is not going to help it by providing a
consistent cut.

One side question to consider is that for MPI_Win_create() if we apply
the synchronizing rule does that hurt the semantics desired for those
operations? I know we have discussed this in the past.

-- Josh

On Thu, Dec 15, 2011 at 5:09 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> For the past week or so we have been struggling with a question of
> desired behavior in the presence of failures for a communication
> construction operation.
>
> Consider the following code sketch:
> ------------------------
> MPI_Comm_dup(MPI_COMM_WORLD, comm[0]);
> for(i=1; i < max; ++i) {
>  r = MPI_Comm_dup(comm[i-1], comm[i]);
>  if( r != MPI_SUCCESS ) {
>    MPI_Comm_validate(comm[i-1])
>    break;
>  }
>
>  r = MPI_Allreduce(..., comm[i], ...);
>  if( r != MPI_SUCCESS ) {
>    MPI_Comm_validate(comm[i])
>    break;
>  }
> }
> ------------------------
> Is this an incorrect sketch because of the 'breaks'? (Are the
> validates necessary?)
>
>
> Assume for the moment that MPI_Comm_dup() is collective -and-
> synchronizing. This allows the communication creation operation to
> guarantee that the communicator is either created everywhere or
> nowhere.
>
> If a process failure occurs some processes will be in the
> MPI_Comm_dup() at iteration X while others are in the MPI_Allreduce()
> at iteration (X-1). The Allreduce will complete with an error since
> the communicator becomes collectively inactive, but some could have
> left early after the decision (like the leader).
>
> So the process stack looks like:
> P0                        P1
> ---------------        ----------------
> Dup(comm[X-1])         Dup(comm[X-1])
> MPI_Allreduce()        MPI_Allreduce()
> Dup(comm[X])           -> Error
>  -> Error
>
> So should P1 be required to call Dup(comm[X])?
>
>
> Option (No): Behave like a collective
>  - The operation will fail (similar to the behavior of collectives),
> and continue to fail until the next MPI_Comm_validate() on the input
> communicator.
>  - This means that communication creation operations are required to
> have as input a collectively active communicator.
>  - If a process failure occurs during the operation, then the
> operation will complete in error.
>
> Option (Yes): Behave like MPI_Comm_validate()
>  - The operation will block until all processes have called it
> (similar to MPI_Comm_validate() collective).
>  - The input communicator need not be collectively active, since we
> know all alive processes will eventually call it.
>  - So if your applications intention is jump out of the loop, P1 must
> first 'match' the Dup(comm[X]) even if they do not intend to use the
> communicator.
>  - In one variation of this: If an implementation returns an error
> from Dup(comm[X]) in P0 (instead of blocking), then (after the
> validate) it must return an error to the matching Dup(comm[X]) in P1.
> As George mentioned on the call, an implementation could do this by
> associating a counter with the number of calls to this creation
> function. If the counter does not match the other processes in the
> communicator then return an error until it does. This loosens the
> blocking restriction a bit.
>
>
> I have a slight preference for Option (No). I think it better fits how
> a developer would expect to react to a failure condition, and it is
> consistent with out behavior for other collective operations. However,
> it is not without its complexities.
>
> What do folks think?
>
> -- Josh
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey