[Mpi3-ft] Stabilization Proposal Updated & MPI_COMM_VALIDATE_ALL_SYNC

Joshua Hursey jjhursey at open-mpi.org
Wed Sep 29 13:04:28 CDT 2010

On Sep 28, 2010, at 9:35 AM, Joshua Hursey wrote:

> I updated the Run-Through Stabilization proposal:
> * Cross reference the MPI_ERR_CANNOT_CONTINUE proposal
> * Fix the change for MPI_COMM_SPLIT per a conversation with Jeff Squyres
> * Add a MPI_COMM_VALIDATE_ALL_SYNC function (more below)
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
> ---------------------------
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization#CollectiveValidationOperations
> This function will likely replace the existing MPI_COMM_VALIDATE_ALL and MPI_COMM_VALIDATE_ALL_CLEAR as the only collective validation function (and probably be renamed to just MPI_COMM_VALIDATE_ALL). This call is collective over the group/communicator and clears all known failures at the top of the call. It returns a count of the total number of failures (previously recognized and unrecognized) in the group/communicator. The user can use the local MPI_COMM_VALIDATE function to access a list of failures, if needed.
> This function is similar to the original proposal's validate function, but adds the count argument as an agreed upon value. After the discussion in Germany, and experimenting with a few application kernels it became apparent that the MPI_COMM_VALIDATE_ALL and MPI_COMM_VALIDATE_ALL_CLEAR functions are often always called together when a failure happens (incurring two collective calls for each communicator). So, since there is a local accessor to the list, we can create a single collective function that 'fixes' the group/communicator in a single operation. Removing the list of know failures from this collective also reduces the memory footprint needed to call this function.
> A few questions for the group:

Per the teleconf discussion:

> 1) So are there any objections to removing the MPI_COMM_VALIDATE_ALL and MPI_COMM_VALIDATE_ALL_CLEAR functions and replacing them with a the MPI_COMM_VALIDATE_ALL_SYNC function (and renaming it MPI_COMM_VALIDATE_ALL)?

This modification has been made in the proposal.

> 2) Group management operations are local and do not require interprocess communication (Section 6.3). In light of this is there any objection to removing the collective validate functions from the group construction section (they will still be defined for communicators)?

The collective group interface has now been removed.

Additionally, I renamed the MPI_Rank_state object to MPI_Rank_info since this object represents a bit more than just the state (also includes the 'generation' number which can be seen as part of the rank's name).


> As always, thanks for the feedback.
> -- Josh
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://www.cs.indiana.edu/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list