[Mpi3-ft] Multiple Communicator Version of MPI_Comm_validate

Wed Sep 14 15:17:13 CDT 2011

Hi Josh,

> 
> Workaround:
> --------------------
> Call MPI_Comm_validate over all of the communicators individually.
> This would involve 'num_comms' collective operations and likely impede
> scalability.
> 
> for(i=0; i < num_comms; ++i) {
>   MPI_Comm_validate(comm[i], failed_grps[i]);
> }
>

Would it not be possible for the app to create a communicator that is the union of all the processes, and subsequently call validate only on that 'super' communicator? I hope I am not missing something from your example.

I liked your Option B as such, however, as you point out, it has significant problems in case of applications consisting of several layers of libraries.

Sayantan.

> 
> Option A:
> Array of communicators
> --------------------
> MPI_Comm_validate_many(comm[], num_comms, failed_grp)
> Validate 'num_comms' communicators, and return a failed group.
>   - or -
> MPI_Comm_validate_many(comm[], num_comms, failed_grps[])
> Validate 'num_comms' communicators, and return a failed group for each
> communicator.
> ----
> 
> In this version of the operation the user passes in an array of
> pointers to communicators. Since communicators are not often created
> in a contiguous array, pointers to communications should probably be
> used. The failed_grps is an array of failures in each of those
> communicators.
> 
> Some questions:
>  * Should all processes pass in the same set of communicators at all
> processes?
>  * Should all communicators be duplicates or subsets of one another?
>  * Does this operation run the risk of a circular dependency if the
> user does not pass in the same set of communicators at all
> participating processes? Is that something the MPI library should
> protect the application from?
> 
> 
> Option B:
> Implicit inherited validation
> --------------------
> MPI_Comm_validate_many(comm, failed_grp)
> ----
> 
> The idea is to add an additional semantic (or maybe new API) to allows
> the validation of a communicator to automatically validates all
> communicators created from it (only dups and subsets of it?).
> 
> The problem with this is that if an application calls
> MPI_Comm_validate on MPI_COMM_WORLD, it changes the semantics of
> communicators that libraries might be using internally without
> notification in those libraries. So this breaks the abstraction
> barrier between the two in possibly a dangerous way.
> 
> Some questions:
>  * Are there some other semantics that we can add to help protect
> libraries? (e.g., after implicit validation the first use of the
> communicator will return a special error code indicating that the
> communicator has been adjusted).
>  * Are there thread safety issues involved with this? (e.g., the
> library operates in a concurrent thread with its own duplicate of the
> communicator. The application does not know about or control the
> concurrent thread but calls MPI_Comm_validate on its own communicator
> and implicitly changes the semantics of the duplicate communicator.)
>  * It is only through the call to MPI_Comm_validate that we can
> provide a uniform group of failed processes globally known. For those
> that were implicitly validated, do we need to provide a way to access
> this group after the call? Does this have implications on the amount
> of storage required for this semantic?
> 
> 
> Option C:
> Explicit inherited validation
> --------------------
> MPI_Comm_validate_link(commA, commB);
> MPI_Comm_validate_many(commA, failed_grp)
> /* Implies MPI_Comm_validate(commB, NULL) */
> 
> MPI_Comm_validate(commA, failed_grp)
> /* Does not imply MPI_Comm_validate(commB, NULL) */
> ----
> 
> In this version the application explicitly links communicators. This
> prevents an application from implicitly altering derived communicators
> out of their scope (e.g., in use by other libraries).
> 
> Some questions:
>  * It is only through the call to MPI_Comm_validate that we can
> provide a uniform group of failed processes globally known. For those
> that were implicitly validated, do we need to provide a way to access
> this group after the call (e.g., for commB)? Does this have
> implications on the amount of storage required for this semantic?
>  * Do we need a mechanism to 'unlink' communicators? Or determine
> which communicators are linked?
>  * Can a communicator be linked to multiple other communicators?
>  * Is the linking a unidirectional operation? (so in the example above
> validating commB does not validate commA unless there is a separate
> MPI_Comm_validate_link(commB, commA) call)
> 
> 
> Option D:
> Other
> --------------------
> Something else...
> 
> 
> Thoughts?
> 
> -- Josh
> 
> 
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft