[Mpi3-ft] Multiple Communicator Version of MPI_Comm_validate
Josh Hursey
jjhursey at open-mpi.org
Fri Sep 16 08:50:08 CDT 2011
On Thu, Sep 15, 2011 at 6:08 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
>> From a scalability standpoint and from a MPI implementation
>
> memory usage standpoint, you want validating any communicator
> that includes an dead endpoint to eliminate ever having to
> use collective communication to validate other communicators
> that include that endpoint. If you can achieve that by the
> current interface then you do not need to add anything (the
> validate becomes a local call so no big deal, it just ensures
> the code that uses that communicator has current information).
Note that if you do not want to use collectives, just point-to-point
then you may never have to call MPI_Comm_validate at all - it is
really just for re-enabling collectives and providing a consistent
global view of the failure set.
For collective communication, I don't think we can achieve the local
semantic with the current version of MPI_Comm_validate() so we are
definitely talking about a new interface. The synchronization and
consistency aspects of MPI_Comm_validate() are pretty important as
building blocks for many fault tolerant applications. Termination
protocols often require that at some point each process knows that
everyone else agrees upon the same state, so that they can take
consistent action. Without faults you can infer this behavior based on
messages received and a single context of execution. But with faults
occurring nondeterministically, and notification at each process
happening at different points in the execution, it can be difficult to
coordinate recovery with other processes.
We could think about what a MPI_Comm_validate_local() version might
look like. Something that validates a communicator locally with
respect to say another communicator.
--------------
MPI_Comm_validate_many(MPI_COMM_WORLD, &grp);
// This implicitly validates all other communicators containing
// any of the failed processes in &grp. Collective across MCW
MPI_Comm_validate_local(commA, MPI_COMM_WORLD, &grp);
// Locally validate commA based on the last validated state of MCW
--------------
So it would re-enable collectives locally and return a list of failed
processes with respect to the last MPI_Comm_validate (humm we might
have to store this information on the communicator which we have
avoided so far).
Would we run into cross-process consistency issues if one process
validated_local and started using collectives on commA before another
process called validate_local? This might be something interesting to
think through as another alternative. The problem with the
MPI_Comm_validate_local call is that you need to know which
communicator was validated to know which to validate locally against
(gets into the user complexity trap).
>
> Otherwise, you want validating a communicator that includes
> that endpoint to validate all other communicators that use
> that endpoint, regardless of how the communicators have been
> derived. You do not want a inherited interface, either explicit
> or implicit (explicit is too hard to use and does not solve the
> real problem of excessive cost being forced on the user by
> the interface; similarly implicit but only inherited does not
> solve the problem). The key is to provide some mechanism to
> alert the user that the communicator has been backdoor validated.
>
So if the application calls MPI_Comm_validate_many(MPI_COMM_WORLD)
then all other communicators containing the identified failed ranks
would automatically be validated in the same step. We could return a
special error code MPI_ERR_YOU_HAVE_BEEN_VALIDATED on the first use of
the derived communicator after the validate call completes (maybe with
some protection for point-to-point operations to non-failed
processes). This would warn the application that after this point
collectives are re-enabled for this derived communicator, and provide
the notification of being backdoor validated. The application/library
would be forced to check the return code from all MPI calls for this
special error code if it uses collective operations.
I'm not sure what we might need to say about the state of operations
on other communicators at the time the MPI_Comm_validate_many() call
is active or the concurrent use of MPI_Comm_validate_many() in
multiple communicators. Are there any threading issues between a
library working in a thread calling collectives and the main
application periodically calling MPI_Comm_validate_many()?
What do folks think?
-- Josh
>
>
> On Thu, 15 Sep 2011, Josh Hursey wrote:
>
>> On Wed, Sep 14, 2011 at 4:17 PM, Sur, Sayantan <sayantan.sur at intel.com>
>> wrote:
>>>
>>> Hi Josh,
>>>
>>>>
>>>> Workaround:
>>>> --------------------
>>>> Call MPI_Comm_validate over all of the communicators individually.
>>>> This would involve 'num_comms' collective operations and likely impede
>>>> scalability.
>>>>
>>>> for(i=0; i < num_comms; ++i) {
>>>> MPI_Comm_validate(comm[i], failed_grps[i]);
>>>> }
>>>>
>>>
>>> Would it not be possible for the app to create a communicator that is the
>>> union of all the processes, and subsequently call validate only on that
>>> 'super' communicator? I hope I am not missing something from your example.
>>
>> In the current spec, no. MPI_Comm_validate only changes the state of
>> the communicator passed. We probably want to create a new API like
>> MPI_Comm_validate_many() to host these new semantics.
>>
>> It is important to remember that the validate operation changes the
>> communicator (primarily just the 'are_collectives_enabled' flag on the
>> communicator), and not anything to do with elements of the group that
>> form it.
>>
>> Currently after creation, a communicator does not need to track from
>> which communicators it was created (at least that's the way I
>> understand it). So creating a super communicator and calling
>> MPI_Comm_validate_many() on that would require such tracking to have
>> the validation propagate to all of the communicators that built it. So
>> we could do it, but the additional state tracking would force
>> additional memory consumption even if the operation is never used,
>> which is slightly problematic.
>>
>>
>>>
>>> I liked your Option B as such, however, as you point out, it has
>>> significant problems in case of applications consisting of several layers of
>>> libraries.
>>
>>
>> Thinking through your question above, I think Option B would require
>> that we track the heritage of communicators after creation, which
>> would increase memory consumption. It would also require us to
>> maintain that linkage across communicator destruction. For example,
>> -----------
>> MPI_Comm_dup(MPI_COMM_WORLD, commA);
>> // MCW is linked to commA
>> MPI_Comm_dup(commA, commB);
>> // MCW is linked to commA
>> // commA is linked to commB
>> MPI_Comm_dup(commB, commC);
>> // MCW is linked to commA
>> // commA is linked to commB
>> // commB is linked to commC
>> MPI_Comm_free(commB);
>> // MCW is linked to commA
>> // commA is linked to commC (since commB is now gone)
>> ------------
>>
>> In the discussion so far it seems that the inheritance is only one
>> way. Meaning that in the example above calling
>> MPI_Comm_validate_many() on commA would validate commC (and commB if
>> it is still around), but not MPI_COMM_WORLD. Is that what we are
>> looking for, or do we want it to be more complete?
>>
>>
>> The explicit linking in Option C puts the user in more control over
>> the overhead of tracking connections between communicators, but has
>> other issues. :/
>>
>> -- Josh
>>
>>
>>>
>>> Sayantan.
>>>
>>>>
>>>> Option A:
>>>> Array of communicators
>>>> --------------------
>>>> MPI_Comm_validate_many(comm[], num_comms, failed_grp)
>>>> Validate 'num_comms' communicators, and return a failed group.
>>>> - or -
>>>> MPI_Comm_validate_many(comm[], num_comms, failed_grps[])
>>>> Validate 'num_comms' communicators, and return a failed group for each
>>>> communicator.
>>>> ----
>>>>
>>>> In this version of the operation the user passes in an array of
>>>> pointers to communicators. Since communicators are not often created
>>>> in a contiguous array, pointers to communications should probably be
>>>> used. The failed_grps is an array of failures in each of those
>>>> communicators.
>>>>
>>>> Some questions:
>>>> * Should all processes pass in the same set of communicators at all
>>>> processes?
>>>> * Should all communicators be duplicates or subsets of one another?
>>>> * Does this operation run the risk of a circular dependency if the
>>>> user does not pass in the same set of communicators at all
>>>> participating processes? Is that something the MPI library should
>>>> protect the application from?
>>>>
>>>>
>>>> Option B:
>>>> Implicit inherited validation
>>>> --------------------
>>>> MPI_Comm_validate_many(comm, failed_grp)
>>>> ----
>>>>
>>>> The idea is to add an additional semantic (or maybe new API) to allows
>>>> the validation of a communicator to automatically validates all
>>>> communicators created from it (only dups and subsets of it?).
>>>>
>>>> The problem with this is that if an application calls
>>>> MPI_Comm_validate on MPI_COMM_WORLD, it changes the semantics of
>>>> communicators that libraries might be using internally without
>>>> notification in those libraries. So this breaks the abstraction
>>>> barrier between the two in possibly a dangerous way.
>>>>
>>>> Some questions:
>>>> * Are there some other semantics that we can add to help protect
>>>> libraries? (e.g., after implicit validation the first use of the
>>>> communicator will return a special error code indicating that the
>>>> communicator has been adjusted).
>>>> * Are there thread safety issues involved with this? (e.g., the
>>>> library operates in a concurrent thread with its own duplicate of the
>>>> communicator. The application does not know about or control the
>>>> concurrent thread but calls MPI_Comm_validate on its own communicator
>>>> and implicitly changes the semantics of the duplicate communicator.)
>>>> * It is only through the call to MPI_Comm_validate that we can
>>>> provide a uniform group of failed processes globally known. For those
>>>> that were implicitly validated, do we need to provide a way to access
>>>> this group after the call? Does this have implications on the amount
>>>> of storage required for this semantic?
>>>>
>>>>
>>>> Option C:
>>>> Explicit inherited validation
>>>> --------------------
>>>> MPI_Comm_validate_link(commA, commB);
>>>> MPI_Comm_validate_many(commA, failed_grp)
>>>> /* Implies MPI_Comm_validate(commB, NULL) */
>>>>
>>>> MPI_Comm_validate(commA, failed_grp)
>>>> /* Does not imply MPI_Comm_validate(commB, NULL) */
>>>> ----
>>>>
>>>> In this version the application explicitly links communicators. This
>>>> prevents an application from implicitly altering derived communicators
>>>> out of their scope (e.g., in use by other libraries).
>>>>
>>>> Some questions:
>>>> * It is only through the call to MPI_Comm_validate that we can
>>>> provide a uniform group of failed processes globally known. For those
>>>> that were implicitly validated, do we need to provide a way to access
>>>> this group after the call (e.g., for commB)? Does this have
>>>> implications on the amount of storage required for this semantic?
>>>> * Do we need a mechanism to 'unlink' communicators? Or determine
>>>> which communicators are linked?
>>>> * Can a communicator be linked to multiple other communicators?
>>>> * Is the linking a unidirectional operation? (so in the example above
>>>> validating commB does not validate commA unless there is a separate
>>>> MPI_Comm_validate_link(commB, commA) call)
>>>>
>>>>
>>>> Option D:
>>>> Other
>>>> --------------------
>>>> Something else...
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>> -- Josh
>>>>
>>>>
>>>> --
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://users.nccs.gov/~jjhursey
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
More information about the mpiwg-ft
mailing list