[Mpi3-ft] Multiple Communicator Version of MPI_Comm_validate

Wed Oct 5 13:49:50 CDT 2011

During the Sept. MPI Forum meeting the concept of a multi-validate was
discussed. From the notes it seems that people want a validation
operation that is restricted to just derived communicators from a
'master' communicator.

Something like:
 MPI_Comm_validate_many(parent_comm, num_descendants,
descendant_comms[], fail_group);

The parent_comm is used to validate the group, then all of the
descendant communicators are also validated in semantically the same
step. 'fail_group' would be the group of failures in relation to
parent_comm. Since all other communicators are descendants of
parent_comm, they can use the fail_group to determine which processes
failed with respect to their individual groups. The call would
synchronize over the parent_comm, and all processes are required to
pass in a complementary set of descendant communicators.

I say a 'complementary set of descendant communicators' (there is
likely a better way to phrase it) because we probably want to allow a
user to do something like:
-----------------------
MPI_Comm_dup(MPI_COMM_WORLD, &dup);
color = rank%2;
key = rank;
MPI_Comm_split(MPI_COMM_WORLD, color, key, &split)

MPI_Comm_validate_many(MPI_COMM_WORLD, 2, {dup, split}, fail_grp);
-----------------------

So even though the 'split' communicator is different for half of the
processes, it should be allowed to be passed into the operation since
it is synchronizing over MPI_COMM_WORLD. But it would be incorrect for
some members of 'split' or 'dup' to not include that communicator in
the descendant list. Something like: All members of a specified
descendant communicator must pass in the same descendant communicator.
And the parent_communicator must be the same at all processes.
Otherwise we might accidentally reenable collectives on a descendant
communicator in a nonuniform way - since a peer did not supply it to
the call.

What do folks think of this option?

-- Josh

On Mon, Sep 19, 2011 at 2:54 AM, Graham, Richard L. <rlgraham at ornl.gov> wrote:
> Bronis,
>  I have seen you raise the memory issue several times.  From my perspective the memory issue is an implementation issue, and is no different than that of the communication implementation itself.  The current proposal is such that an efficient implementation can be developed - I am not aware of any interface issues that require non-scalable memory usage.
>  From my perspective, the reason to validate multiple communicators are
>   - user convenience (weak)
>  - reduce global communications.  Unfortunately, since one is quarrying for the state of a global object (the communicator, file handle, window) local information is not sufficient to infer it's state.
>
> Rich
>
> On Sep 15, 2011, at 6:08 PM, Bronis R. de Supinski wrote:
>
>>
>>> From a scalability standpoint and from a MPI implementation
>> memory usage standpoint, you want validating any communicator
>> that includes an dead endpoint to eliminate ever having to
>> use collective communication to validate other communicators
>> that include that endpoint. If you can achieve that by the
>> current interface then you do not need to add anything (the
>> validate becomes a local call so no big deal, it just ensures
>> the code that uses that communicator has current information).
>>
>> Otherwise, you want validating a communicator that includes
>> that endpoint to validate all other communicators that use
>> that endpoint, regardless of how the communicators have been
>> derived. You do not want a inherited interface, either explicit
>> or implicit (explicit is too hard to use and does not solve the
>> real problem of excessive cost being forced on the user by
>> the interface; similarly implicit but only inherited does not
>> solve the problem). The key is to provide some mechanism to
>> alert the user that the communicator has been backdoor validated.
>>
>>
>>
>> On Thu, 15 Sep 2011, Josh Hursey wrote:
>>
>>> On Wed, Sep 14, 2011 at 4:17 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
>>>> Hi Josh,
>>>>
>>>>>
>>>>> Workaround:
>>>>> --------------------
>>>>> Call MPI_Comm_validate over all of the communicators individually.
>>>>> This would involve 'num_comms' collective operations and likely impede
>>>>> scalability.
>>>>>
>>>>> for(i=0; i < num_comms; ++i) {
>>>>>  MPI_Comm_validate(comm[i], failed_grps[i]);
>>>>> }
>>>>>
>>>>
>>>> Would it not be possible for the app to create a communicator that is the union of all the processes, and subsequently call validate only on that 'super' communicator? I hope I am not missing something from your example.
>>>
>>> In the current spec, no. MPI_Comm_validate only changes the state of
>>> the communicator passed. We probably want to create a new API like
>>> MPI_Comm_validate_many() to host these new semantics.
>>>
>>> It is important to remember that the validate operation changes the
>>> communicator (primarily just the 'are_collectives_enabled' flag on the
>>> communicator), and not anything to do with elements of the group that
>>> form it.
>>>
>>> Currently after creation, a communicator does not need to track from
>>> which communicators it was created (at least that's the way I
>>> understand it). So creating a super communicator and calling
>>> MPI_Comm_validate_many() on that would require such tracking to have
>>> the validation propagate to all of the communicators that built it. So
>>> we could do it, but the additional state tracking would force
>>> additional memory consumption even if the operation is never used,
>>> which is slightly problematic.
>>>
>>>
>>>>
>>>> I liked your Option B as such, however, as you point out, it has significant problems in case of applications consisting of several layers of libraries.
>>>
>>>
>>> Thinking through your question above, I think Option B would require
>>> that we track the heritage of communicators after creation, which
>>> would increase memory consumption. It would also require us to
>>> maintain that linkage across communicator destruction. For example,
>>> -----------
>>> MPI_Comm_dup(MPI_COMM_WORLD, commA);
>>> // MCW   is linked to commA
>>> MPI_Comm_dup(commA, commB);
>>> // MCW   is linked to commA
>>> // commA is linked to commB
>>> MPI_Comm_dup(commB, commC);
>>> // MCW   is linked to commA
>>> // commA is linked to commB
>>> // commB is linked to commC
>>> MPI_Comm_free(commB);
>>> // MCW   is linked to commA
>>> // commA is linked to commC (since commB is now gone)
>>> ------------
>>>
>>> In the discussion so far it seems that the inheritance is only one
>>> way. Meaning that in the example above calling
>>> MPI_Comm_validate_many() on commA would validate commC (and commB if
>>> it is still around), but not MPI_COMM_WORLD. Is that what we are
>>> looking for, or do we want it to be more complete?
>>>
>>>
>>> The explicit linking in Option C puts the user in more control over
>>> the overhead of tracking connections between communicators, but has
>>> other issues. :/
>>>
>>> -- Josh
>>>
>>>
>>>>
>>>> Sayantan.
>>>>
>>>>>
>>>>> Option A:
>>>>> Array of communicators
>>>>> --------------------
>>>>> MPI_Comm_validate_many(comm[], num_comms, failed_grp)
>>>>> Validate 'num_comms' communicators, and return a failed group.
>>>>>  - or -
>>>>> MPI_Comm_validate_many(comm[], num_comms, failed_grps[])
>>>>> Validate 'num_comms' communicators, and return a failed group for each
>>>>> communicator.
>>>>> ----
>>>>>
>>>>> In this version of the operation the user passes in an array of
>>>>> pointers to communicators. Since communicators are not often created
>>>>> in a contiguous array, pointers to communications should probably be
>>>>> used. The failed_grps is an array of failures in each of those
>>>>> communicators.
>>>>>
>>>>> Some questions:
>>>>> * Should all processes pass in the same set of communicators at all
>>>>> processes?
>>>>> * Should all communicators be duplicates or subsets of one another?
>>>>> * Does this operation run the risk of a circular dependency if the
>>>>> user does not pass in the same set of communicators at all
>>>>> participating processes? Is that something the MPI library should
>>>>> protect the application from?
>>>>>
>>>>>
>>>>> Option B:
>>>>> Implicit inherited validation
>>>>> --------------------
>>>>> MPI_Comm_validate_many(comm, failed_grp)
>>>>> ----
>>>>>
>>>>> The idea is to add an additional semantic (or maybe new API) to allows
>>>>> the validation of a communicator to automatically validates all
>>>>> communicators created from it (only dups and subsets of it?).
>>>>>
>>>>> The problem with this is that if an application calls
>>>>> MPI_Comm_validate on MPI_COMM_WORLD, it changes the semantics of
>>>>> communicators that libraries might be using internally without
>>>>> notification in those libraries. So this breaks the abstraction
>>>>> barrier between the two in possibly a dangerous way.
>>>>>
>>>>> Some questions:
>>>>> * Are there some other semantics that we can add to help protect
>>>>> libraries? (e.g., after implicit validation the first use of the
>>>>> communicator will return a special error code indicating that the
>>>>> communicator has been adjusted).
>>>>> * Are there thread safety issues involved with this? (e.g., the
>>>>> library operates in a concurrent thread with its own duplicate of the
>>>>> communicator. The application does not know about or control the
>>>>> concurrent thread but calls MPI_Comm_validate on its own communicator
>>>>> and implicitly changes the semantics of the duplicate communicator.)
>>>>> * It is only through the call to MPI_Comm_validate that we can
>>>>> provide a uniform group of failed processes globally known. For those
>>>>> that were implicitly validated, do we need to provide a way to access
>>>>> this group after the call? Does this have implications on the amount
>>>>> of storage required for this semantic?
>>>>>
>>>>>
>>>>> Option C:
>>>>> Explicit inherited validation
>>>>> --------------------
>>>>> MPI_Comm_validate_link(commA, commB);
>>>>> MPI_Comm_validate_many(commA, failed_grp)
>>>>> /* Implies MPI_Comm_validate(commB, NULL) */
>>>>>
>>>>> MPI_Comm_validate(commA, failed_grp)
>>>>> /* Does not imply MPI_Comm_validate(commB, NULL) */
>>>>> ----
>>>>>
>>>>> In this version the application explicitly links communicators. This
>>>>> prevents an application from implicitly altering derived communicators
>>>>> out of their scope (e.g., in use by other libraries).
>>>>>
>>>>> Some questions:
>>>>> * It is only through the call to MPI_Comm_validate that we can
>>>>> provide a uniform group of failed processes globally known. For those
>>>>> that were implicitly validated, do we need to provide a way to access
>>>>> this group after the call (e.g., for commB)? Does this have
>>>>> implications on the amount of storage required for this semantic?
>>>>> * Do we need a mechanism to 'unlink' communicators? Or determine
>>>>> which communicators are linked?
>>>>> * Can a communicator be linked to multiple other communicators?
>>>>> * Is the linking a unidirectional operation? (so in the example above
>>>>> validating commB does not validate commA unless there is a separate
>>>>> MPI_Comm_validate_link(commB, commA) call)
>>>>>
>>>>>
>>>>> Option D:
>>>>> Other
>>>>> --------------------
>>>>> Something else...
>>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> -- Josh
>>>>>
>>>>>
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> hxxp://users.nccs.gov/~jjhursey
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> hxxp://users.nccs.gov/~jjhursey
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> hxxp://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>

-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey