[Mpi3-ft] Matching MPI communication object creation across process failure

Josh Hursey jjhursey at open-mpi.org
Fri Feb 3 08:04:13 CST 2012


Yeah I think that is more of a clarification on (A) - if I understand your
email correctly.
- If the input communicator is -not- collectively active
  - Communicator creation operations will fail in error at all alive
calling processes
  - So the creation operations never work around emerging failed processes
- If the input communicator is collectively active
  - Communication creation operations will return successfully, unless
  - There is an emerging 'new' failure that makes the communicator
collectively inactive, in which case the operation will either
    - Return successfully everywhere if it has passed the decision point in
the protocol (at least one alive process returned 'success' already)
    - Return an error everywhere if it is before the decision point in the
protocol. Then the decision is 'return an error' and everyone will decide
to return an error.
- Once collectively inactive subsequent communicator creation operations
will return in error immediately
  - MPI_Comm_validate is used to reenable the collectives including
communicator creation operations.
  - The MPI_Comm_vallidate when it switches a communicator from
collectively inactive to collectively active creates a 'cut' in the message
matching. Meaning that collective operations (including communicator
creation operations) initiated before the MPI_Comm_validate (under these
circumstances) do not need to match across the logical 'cut'. So the user
does -not- need to post a matching MPI_Bcast or MPI_Comm_dup after the
MPI_Comm_validate if a new failure caused the cut to occur. -- But this
gets into more of the issues of (3).

How do those semantics sound to folks?

-- Josh

On Thu, Feb 2, 2012 at 5:28 PM, Sur, Sayantan <sayantan.sur at intel.com>wrote:

>  Hi Josh,****
>
> ** **
>
> I agree on (1).****
>
> ** **
>
> Regarding (2), I’m wondering if another case C is feasible. What do you
> think?****
>
> ** **
>
> ( C ) if the input communicator is not required to be collectively active:
> MPI_Comm_split returns with error everywhere (all live processes) if it
> discovers any existing or emerging failure. i.e. it does not try to ‘work
> around’ either type of failure existing or emerging.****
>
> ** **
>
> i.e. MPI_Comm_split automatically returns “valid” communicators, if the
> original communicator had no failures. In case of failure, it simply
> returns error. In this case, MPI_Comm_validate() needs to be called in the
> error path, not in the error free path.****
>
> ** **
>
> Thanks.****
>
> ** **
>
> ===****
>
> Sayantan Sur, Ph.D.****
>
> Intel Corp.****
>
> ** **
>
> *From:* mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org] *On Behalf Of *Josh Hursey
> *Sent:* Wednesday, February 01, 2012 2:26 PM
> *To:* MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> *Subject:* Re: [Mpi3-ft] Matching MPI communication object creation
> across process failure****
>
> ** **
>
> Some notes regarding this thread from the teleconf:****
>
> ** **
>
> There are three components to the discussion of how communicator creation
> operations should behave:****
>
>  - (1) Uniform creation of the object (created everywhere or no where)****
>
>  - (2) Requires the input communicator to be collective active****
>
>  - (3) Matches across emerging process failure****
>
> ** **
>
> ** **
>
> (1) seems to be something we all agree on.****
>
> ** **
>
> ** **
>
> (2) we discussed a bit more on the call. Consider an MPI_Comm_split
> operation where the application wants to divide the communicator in half.*
> ***
>
>  - (A) If the input communicator is required to be collectively active
> then the MPI_Comm_split will return an error if a process fails (makes the
> input communicator collective inactive) and the object cannot be uniformly
> created. So MPI_Comm_split does not 'work around' emerging failure, but
> errors out at the first sign of the failure before the 'decision point' in
> the protocol. The 'decision point' being the point in time where the
> communication is designated a created.****
>
>  - (B) if the input communicator is -not- required to be collective active
> then the MPI_Comm_split must be able to work around existing and emerging
> failure to agree upon the group membership of the output communicator(s).
> The resulting communicators may be of unintended size since the
> MPI_Comm_split is working around the errors. So the user is forced to call
> MPI_Comm_validate after the MPI_Comm_split on the input communicator to
> make sure that all processes exited the MPI_Comm_split with acceptable
> communicators (if such a distinction is important).****
>
> ** **
>
> So in (A) the following program is correct:****
>
> -------------------------****
>
> if( comm_rank < (comm_size/2) ) {****
>
>   color = 0;****
>
> } else {****
>
>   color = 1;****
>
> }****
>
> ** **
>
> do {****
>
>   ret = MPI_Comm_split( comm, color, key, &new_comm );****
>
>   if( ret == MPI_ERR_PROC_FAIL_STOP ) {****
>
>     /* Re-enable collectives over this communicator */****
>
>     MPI_Comm_validate( comm, &failed_grp );****
>
>     MPI_Group_free( &failed_grp );****
>
>     /* Adjust colors to account for failed processes *****
>
>   }****
>
>   /* If there was a process failure, then try again */****
>
> } while( ret == MPI_ERR_PROC_FAIL_STOP );****
>
> -------------------------****
>
> In this example MPI_Comm_split will either return successfully with the
> exact groups asked for, or will return an error if a new process failure is
> encountered before the communicators are created. One nice thing to point
> out is that MPI_Comm_validate() is only called in the error path, and not
> in the failure free path.****
>
> ** **
>
> In (B) it is less likely that MPI_Comm_split will return in error since it
> is working around existing and emerging process failure. Since it is
> working around emerging failure so it is possible that the resulting
> communicators are of a size that is not acceptable to the application. So
> the application will need to do some collective, and call MPI_Comm_validate
> to ensure that collective completed everywhere before deciding to use the
> new communicators.****
>
> -------------------------****
>
> ...****
>
> } while( ret == MPI_ERR_PROC_FAIL_STOP );****
>
> /* Collective operation to check if communicators are of acceptable length
> */****
>
> /* MPI_Comm_validate() to make sure the collective above completed
> successfully everywhere */****
>
> /* If acceptable then continue */****
>
> /* If unacceptable, then destroy them and call MPI_Comm_split again with
> new colores */****
>
> -------------------------****
>
> ** **
>
> So (B) pushes the user to call MPI_Comm_validate in the failure free code
> path, and do some additional checking. Option (A) keeps the
> MPI_Comm_validate in the failure full/error path.****
>
> ** **
>
> So it seems that requiring the input communicator to be collectively
> active (option A) is somewhat better.****
>
> ** **
>
> ** **
>
> Regarding matching across failure (3), this seems to be implied by (1).
> Since the uniformity of returning is provided by an agreement protocol. The
> question is how difficult would it be to not require matching across
> failure (3) if we know that the creation call can is allowed to exit in
> error if it encounters an error before agreement (Option 2.A above)? I am
> fairly sure that we can implement this safely, but I would like to dwell on
> it a bit more to be sure.****
>
> ** **
>
> ** **
>
> What do others think about these points?****
>
> ** **
>
> -- Josh****
>
> ** **
>
> On Tue, Jan 31, 2012 at 12:35 PM, Josh Hursey <jjhursey at open-mpi.org>
> wrote:****
>
> Actually the following thread might be more useful for this discussion:***
> *
>
>   http://lists.mpi-forum.org/mpi3-ft/2011/12/0940.php****
>
> ** **
>
> The example did not come out well in the archives, so below is the diagram
> again (hopefully that will work):****
>
> So the process stack looks like:****
>
> P0                        P1****
>
> ---------------        ----------------****
>
> Dup(comm[X-1])         Dup(comm[X-1])****
>
> MPI_Allreduce()        MPI_Allreduce()****
>
> Dup(comm[X])           -> Error****
>
>  -> Error****
>
> ** **
>
> So should P1 be required to call Dup(comm[X])?****
>
> ** **
>
> -- Josh****
>
> ** **
>
> On Wed, Jan 25, 2012 at 5:08 PM, Josh Hursey <jjhursey at open-mpi.org>
> wrote:****
>
> The current proposal states that MPI object creation functions (e.g.,
> MPI_Comm_create, MPI_Win_create, MPI_File_open):****
>
> -------------------****
>
> All participating communicator(s) must be collectively active before
> calling any communicator creation operation. Otherwise, the communicator
> creation operation will uniformly raise an error code of the class
> MPI_ERR_PROC_FAIL_STOP.****
>
> ** **
>
> If a process failure prevents the uniform creation of the communicator
> then the communicator construction operation must ensure that the
> communicator is not created, and all alive participating processes will
> raise an error code of the class MPI_ERR_PROC_FAIL_STOP. Communicator
> construction operations will match across the notification of a process
> failure. As such, all alive processes must call the communicator
> construction operations the same number of times regardless of whether the
> emergent process failure makes the call irrelevant to the application.****
>
> -------------------****
>
> ** **
>
> So there are three points here:****
>
>  (1) That the communicator must be 'collectively active' before calling
> the operation,****
>
>  (2) Uniform creation of the communication object, and****
>
>  (3) Creation operations match across process failure.****
>
> ** **
>
> Point (2) seems to be necessary so that all processes only ever see a
> communication object that is consistent across all processes. This implies
> a fault tolerant agreement protocol (group membership).****
>
> ** **
>
> There was a question about why point (3) is necessary. We (Darius, UTK,
> and I) discussed this on 12/12/2011, and I posted my notes on this to the
> list:****
>
>   http://lists.mpi-forum.org/mpi3-ft/2011/12/0938.php****
>
> Looking back at my written notes, they don't have much more than that to
> add to the discussion.****
>
> ** **
>
> So the problem with (3) seemed to arrise out of the failure handlers,
> though I am not convinced that they are strictly to blame in this
> circumstance. It seems that the agreement protocol might be factoring into
> the discussion as well since it is strongly synchronizing, if not all
> processes call the operation how does it know when to bail out. The peer
> processes are (a) calling that operation, (b) going to call it but have not
> yet, or (c) will never call it because they decided independently not to
> base on a locally reported process failure.****
>
> ** **
>
> It seems that the core problem has to do with when to break out of the
> collective creation operation, and when to restore matching.****
>
> ** **
>
> So should re reconsider the restriction on (3)? More to the point, is it
> safe to not require (3)?****
>
> ** **
>
> -- Josh****
>
> ** **
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey****
>
>
>
> ****
>
> ** **
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey****
>
>
>
> ****
>
> ** **
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey****
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120203/0654a681/attachment-0001.html>


More information about the mpiwg-ft mailing list