[Mpi3-ft] Multiple Communicator Version of MPI_Comm_validate

Wed Sep 14 12:51:43 CDT 2011

(I decided to split this into its own thread of discussion from what
was developing on [1])

Recently it was mentioned on the list [1] that users would like the
ability to validate multiple communicators (or windows or file
handles) in a single collective call. We have discussed this a few
times in the past. Defining the interface and semantics appropriately
has prevented us from defining such an operation so far. So though we
saw the desire for such an operation we pushed it to a separate
follow-on ticket until we could get it right.

[1] http://lists.mpi-forum.org/mpi3-ft/2011/09/0836.php

Problem:
--------------------
MPI_Comm_validate(comm, failed_grp)

The current operation validates one communicator, and returns the
group of globally agreed upon failed processes associated with that
communicator. Users would like the ability to, in a single collective
call, validate multiple communicators. This would likely result in
better performance over calling MPI_Comm_validate over each individual
communicator.

Workaround:
--------------------
Call MPI_Comm_validate over all of the communicators individually.
This would involve 'num_comms' collective operations and likely impede
scalability.

for(i=0; i < num_comms; ++i) {
  MPI_Comm_validate(comm[i], failed_grps[i]);
}

Option A:
Array of communicators
--------------------
MPI_Comm_validate_many(comm[], num_comms, failed_grp)
Validate 'num_comms' communicators, and return a failed group.
  - or -
MPI_Comm_validate_many(comm[], num_comms, failed_grps[])
Validate 'num_comms' communicators, and return a failed group for each
communicator.
----

In this version of the operation the user passes in an array of
pointers to communicators. Since communicators are not often created
in a contiguous array, pointers to communications should probably be
used. The failed_grps is an array of failures in each of those
communicators.

Some questions:
 * Should all processes pass in the same set of communicators at all processes?
 * Should all communicators be duplicates or subsets of one another?
 * Does this operation run the risk of a circular dependency if the
user does not pass in the same set of communicators at all
participating processes? Is that something the MPI library should
protect the application from?

Option B:
Implicit inherited validation
--------------------
MPI_Comm_validate_many(comm, failed_grp)
----

The idea is to add an additional semantic (or maybe new API) to allows
the validation of a communicator to automatically validates all
communicators created from it (only dups and subsets of it?).

The problem with this is that if an application calls
MPI_Comm_validate on MPI_COMM_WORLD, it changes the semantics of
communicators that libraries might be using internally without
notification in those libraries. So this breaks the abstraction
barrier between the two in possibly a dangerous way.

Some questions:
 * Are there some other semantics that we can add to help protect
libraries? (e.g., after implicit validation the first use of the
communicator will return a special error code indicating that the
communicator has been adjusted).
 * Are there thread safety issues involved with this? (e.g., the
library operates in a concurrent thread with its own duplicate of the
communicator. The application does not know about or control the
concurrent thread but calls MPI_Comm_validate on its own communicator
and implicitly changes the semantics of the duplicate communicator.)
 * It is only through the call to MPI_Comm_validate that we can
provide a uniform group of failed processes globally known. For those
that were implicitly validated, do we need to provide a way to access
this group after the call? Does this have implications on the amount
of storage required for this semantic?

Option C:
Explicit inherited validation
--------------------
MPI_Comm_validate_link(commA, commB);
MPI_Comm_validate_many(commA, failed_grp)
/* Implies MPI_Comm_validate(commB, NULL) */

MPI_Comm_validate(commA, failed_grp)
/* Does not imply MPI_Comm_validate(commB, NULL) */
----

In this version the application explicitly links communicators. This
prevents an application from implicitly altering derived communicators
out of their scope (e.g., in use by other libraries).

Some questions:
 * It is only through the call to MPI_Comm_validate that we can
provide a uniform group of failed processes globally known. For those
that were implicitly validated, do we need to provide a way to access
this group after the call (e.g., for commB)? Does this have
implications on the amount of storage required for this semantic?
 * Do we need a mechanism to 'unlink' communicators? Or determine
which communicators are linked?
 * Can a communicator be linked to multiple other communicators?
 * Is the linking a unidirectional operation? (so in the example above
validating commB does not validate commA unless there is a separate
MPI_Comm_validate_link(commB, commA) call)

Option D:
Other
--------------------
Something else...

Thoughts?

-- Josh

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey