[Mpi3-ft] Matching MPI communication object creation across process failure

Josh Hursey jjhursey at open-mpi.org
Wed Feb 1 16:26:07 CST 2012

Some notes regarding this thread from the teleconf:

There are three components to the discussion of how communicator creation
operations should behave:
 - (1) Uniform creation of the object (created everywhere or no where)
 - (2) Requires the input communicator to be collective active
 - (3) Matches across emerging process failure

(1) seems to be something we all agree on.

(2) we discussed a bit more on the call. Consider an MPI_Comm_split
operation where the application wants to divide the communicator in half.
 - (A) If the input communicator is required to be collectively active then
the MPI_Comm_split will return an error if a process fails (makes the input
communicator collective inactive) and the object cannot be uniformly
created. So MPI_Comm_split does not 'work around' emerging failure, but
errors out at the first sign of the failure before the 'decision point' in
the protocol. The 'decision point' being the point in time where the
communication is designated a created.
 - (B) if the input communicator is -not- required to be collective active
then the MPI_Comm_split must be able to work around existing and emerging
failure to agree upon the group membership of the output communicator(s).
The resulting communicators may be of unintended size since the
MPI_Comm_split is working around the errors. So the user is forced to call
MPI_Comm_validate after the MPI_Comm_split on the input communicator to
make sure that all processes exited the MPI_Comm_split with acceptable
communicators (if such a distinction is important).

So in (A) the following program is correct:
if( comm_rank < (comm_size/2) ) {
  color = 0;
} else {
  color = 1;

do {
  ret = MPI_Comm_split( comm, color, key, &new_comm );
  if( ret == MPI_ERR_PROC_FAIL_STOP ) {
    /* Re-enable collectives over this communicator */
    MPI_Comm_validate( comm, &failed_grp );
    MPI_Group_free( &failed_grp );
    /* Adjust colors to account for failed processes *
  /* If there was a process failure, then try again */
} while( ret == MPI_ERR_PROC_FAIL_STOP );
In this example MPI_Comm_split will either return successfully with the
exact groups asked for, or will return an error if a new process failure is
encountered before the communicators are created. One nice thing to point
out is that MPI_Comm_validate() is only called in the error path, and not
in the failure free path.

In (B) it is less likely that MPI_Comm_split will return in error since it
is working around existing and emerging process failure. Since it is
working around emerging failure so it is possible that the resulting
communicators are of a size that is not acceptable to the application. So
the application will need to do some collective, and call MPI_Comm_validate
to ensure that collective completed everywhere before deciding to use the
new communicators.
} while( ret == MPI_ERR_PROC_FAIL_STOP );
/* Collective operation to check if communicators are of acceptable length
/* MPI_Comm_validate() to make sure the collective above completed
successfully everywhere */
/* If acceptable then continue */
/* If unacceptable, then destroy them and call MPI_Comm_split again with
new colores */

So (B) pushes the user to call MPI_Comm_validate in the failure free code
path, and do some additional checking. Option (A) keeps the
MPI_Comm_validate in the failure full/error path.

So it seems that requiring the input communicator to be collectively active
(option A) is somewhat better.

Regarding matching across failure (3), this seems to be implied by (1).
Since the uniformity of returning is provided by an agreement protocol. The
question is how difficult would it be to not require matching across
failure (3) if we know that the creation call can is allowed to exit in
error if it encounters an error before agreement (Option 2.A above)? I am
fairly sure that we can implement this safely, but I would like to dwell on
it a bit more to be sure.

What do others think about these points?

-- Josh

On Tue, Jan 31, 2012 at 12:35 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:

> Actually the following thread might be more useful for this discussion:
>   http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php
> The example did not come out well in the archives, so below is the diagram
> again (hopefully that will work):
> So the process stack looks like:
> P0                        P1
> ---------------        ----------------
> Dup(comm[X-1])         Dup(comm[X-1])
> MPI_Allreduce()        MPI_Allreduce()
> Dup(comm[X])           -> Error
>  -> Error
> So should P1 be required to call Dup(comm[X])?
> -- Josh
> On Wed, Jan 25, 2012 at 5:08 PM, Josh Hursey <jjhursey at open-mpi.org>wrote:
>> The current proposal states that MPI object creation functions (e.g.,
>> MPI_Comm_create, MPI_Win_create, MPI_File_open):
>> -------------------
>> All participating communicator(s) must be collectively active before
>> calling any communicator creation operation. Otherwise, the communicator
>> creation operation will uniformly raise an error code of the class
>> If a process failure prevents the uniform creation of the communicator
>> then the communicator construction operation must ensure that the
>> communicator is not created, and all alive participating processes will
>> raise an error code of the class MPI_ERR_PROC_FAIL_STOP. Communicator
>> construction operations will match across the notification of a process
>> failure. As such, all alive processes must call the communicator
>> construction operations the same number of times regardless of whether the
>> emergent process failure makes the call irrelevant to the application.
>> -------------------
>> So there are three points here:
>>  (1) That the communicator must be 'collectively active' before calling
>> the operation,
>>  (2) Uniform creation of the communication object, and
>>  (3) Creation operations match across process failure.
>> Point (2) seems to be necessary so that all processes only ever see a
>> communication object that is consistent across all processes. This implies
>> a fault tolerant agreement protocol (group membership).
>> There was a question about why point (3) is necessary. We (Darius, UTK,
>> and I) discussed this on 12/12/2011, and I posted my notes on this to the
>> list:
>>   http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php
>> Looking back at my written notes, they don't have much more than that to
>> add to the discussion.
>> So the problem with (3) seemed to arrise out of the failure handlers,
>> though I am not convinced that they are strictly to blame in this
>> circumstance. It seems that the agreement protocol might be factoring into
>> the discussion as well since it is strongly synchronizing, if not all
>> processes call the operation how does it know when to bail out. The peer
>> processes are (a) calling that operation, (b) going to call it but have not
>> yet, or (c) will never call it because they decided independently not to
>> base on a locally reported process failure.
>> It seems that the core problem has to do with when to break out of the
>> collective creation operation, and when to restore matching.
>> So should re reconsider the restriction on (3)? More to the point, is it
>> safe to not require (3)?
>> -- Josh
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120201/b5cbab7c/attachment.html>

More information about the mpiwg-ft mailing list