[Mpi3-ft] Matching MPI communication object creation across process failure

Sur, Sayantan sayantan.sur at intel.com
Thu Feb 2 16:28:22 CST 2012

Hi Josh,

I agree on (1).

Regarding (2), I'm wondering if another case C is feasible. What do you think?

( C ) if the input communicator is not required to be collectively active: MPI_Comm_split returns with error everywhere (all live processes) if it discovers any existing or emerging failure. i.e. it does not try to 'work around' either type of failure existing or emerging.

i.e. MPI_Comm_split automatically returns "valid" communicators, if the original communicator had no failures. In case of failure, it simply returns error. In this case, MPI_Comm_validate() needs to be called in the error path, not in the error free path.


Sayantan Sur, Ph.D.
Intel Corp.

From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Josh Hursey
Sent: Wednesday, February 01, 2012 2:26 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Matching MPI communication object creation across process failure

Some notes regarding this thread from the teleconf:

There are three components to the discussion of how communicator creation operations should behave:
 - (1) Uniform creation of the object (created everywhere or no where)
 - (2) Requires the input communicator to be collective active
 - (3) Matches across emerging process failure

(1) seems to be something we all agree on.

(2) we discussed a bit more on the call. Consider an MPI_Comm_split operation where the application wants to divide the communicator in half.
 - (A) If the input communicator is required to be collectively active then the MPI_Comm_split will return an error if a process fails (makes the input communicator collective inactive) and the object cannot be uniformly created. So MPI_Comm_split does not 'work around' emerging failure, but errors out at the first sign of the failure before the 'decision point' in the protocol. The 'decision point' being the point in time where the communication is designated a created.
 - (B) if the input communicator is -not- required to be collective active then the MPI_Comm_split must be able to work around existing and emerging failure to agree upon the group membership of the output communicator(s). The resulting communicators may be of unintended size since the MPI_Comm_split is working around the errors. So the user is forced to call MPI_Comm_validate after the MPI_Comm_split on the input communicator to make sure that all processes exited the MPI_Comm_split with acceptable communicators (if such a distinction is important).

So in (A) the following program is correct:
if( comm_rank < (comm_size/2) ) {
  color = 0;
} else {
  color = 1;

do {
  ret = MPI_Comm_split( comm, color, key, &new_comm );
  if( ret == MPI_ERR_PROC_FAIL_STOP ) {
    /* Re-enable collectives over this communicator */
    MPI_Comm_validate( comm, &failed_grp );
    MPI_Group_free( &failed_grp );
    /* Adjust colors to account for failed processes *
  /* If there was a process failure, then try again */
} while( ret == MPI_ERR_PROC_FAIL_STOP );
In this example MPI_Comm_split will either return successfully with the exact groups asked for, or will return an error if a new process failure is encountered before the communicators are created. One nice thing to point out is that MPI_Comm_validate() is only called in the error path, and not in the failure free path.

In (B) it is less likely that MPI_Comm_split will return in error since it is working around existing and emerging process failure. Since it is working around emerging failure so it is possible that the resulting communicators are of a size that is not acceptable to the application. So the application will need to do some collective, and call MPI_Comm_validate to ensure that collective completed everywhere before deciding to use the new communicators.
} while( ret == MPI_ERR_PROC_FAIL_STOP );
/* Collective operation to check if communicators are of acceptable length */
/* MPI_Comm_validate() to make sure the collective above completed successfully everywhere */
/* If acceptable then continue */
/* If unacceptable, then destroy them and call MPI_Comm_split again with new colores */

So (B) pushes the user to call MPI_Comm_validate in the failure free code path, and do some additional checking. Option (A) keeps the MPI_Comm_validate in the failure full/error path.

So it seems that requiring the input communicator to be collectively active (option A) is somewhat better.

Regarding matching across failure (3), this seems to be implied by (1). Since the uniformity of returning is provided by an agreement protocol. The question is how difficult would it be to not require matching across failure (3) if we know that the creation call can is allowed to exit in error if it encounters an error before agreement (Option 2.A above)? I am fairly sure that we can implement this safely, but I would like to dwell on it a bit more to be sure.

What do others think about these points?

-- Josh

On Tue, Jan 31, 2012 at 12:35 PM, Josh Hursey <jjhursey at open-mpi.org<mailto:jjhursey at open-mpi.org>> wrote:
Actually the following thread might be more useful for this discussion:

The example did not come out well in the archives, so below is the diagram again (hopefully that will work):
So the process stack looks like:
P0                        P1
---------------        ----------------
Dup(comm[X-1])         Dup(comm[X-1])
MPI_Allreduce()        MPI_Allreduce()
Dup(comm[X])           -> Error
 -> Error

So should P1 be required to call Dup(comm[X])?

-- Josh

On Wed, Jan 25, 2012 at 5:08 PM, Josh Hursey <jjhursey at open-mpi.org<mailto:jjhursey at open-mpi.org>> wrote:
The current proposal states that MPI object creation functions (e.g., MPI_Comm_create, MPI_Win_create, MPI_File_open):
All participating communicator(s) must be collectively active before calling any communicator creation operation. Otherwise, the communicator creation operation will uniformly raise an error code of the class MPI_ERR_PROC_FAIL_STOP.

If a process failure prevents the uniform creation of the communicator then the communicator construction operation must ensure that the communicator is not created, and all alive participating processes will raise an error code of the class MPI_ERR_PROC_FAIL_STOP. Communicator construction operations will match across the notification of a process failure. As such, all alive processes must call the communicator construction operations the same number of times regardless of whether the emergent process failure makes the call irrelevant to the application.

So there are three points here:
 (1) That the communicator must be 'collectively active' before calling the operation,
 (2) Uniform creation of the communication object, and
 (3) Creation operations match across process failure.

Point (2) seems to be necessary so that all processes only ever see a communication object that is consistent across all processes. This implies a fault tolerant agreement protocol (group membership).

There was a question about why point (3) is necessary. We (Darius, UTK, and I) discussed this on 12/12/2011, and I posted my notes on this to the list:
Looking back at my written notes, they don't have much more than that to add to the discussion.

So the problem with (3) seemed to arrise out of the failure handlers, though I am not convinced that they are strictly to blame in this circumstance. It seems that the agreement protocol might be factoring into the discussion as well since it is strongly synchronizing, if not all processes call the operation how does it know when to bail out. The peer processes are (a) calling that operation, (b) going to call it but have not yet, or (c) will never call it because they decided independently not to base on a locally reported process failure.

It seems that the core problem has to do with when to break out of the collective creation operation, and when to restore matching.

So should re reconsider the restriction on (3)? More to the point, is it safe to not require (3)?

-- Josh

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120202/cd0f022e/attachment-0001.html>

More information about the mpiwg-ft mailing list