[Mpi3-ft] MPI_Comm_validate parameters

Darius Buntinas buntinas at mcs.anl.gov
Fri Feb 25 15:27:51 CST 2011


Well, in an asynchronous system, you can't guarantee that every process has the complete picture of failed processes, so you can only count on a partial answer. And this is what the set L is: a process's approximation of F.  Recall that G is a SUBSET of F, not the other way around.

So the question is really why do we need G?  Comm_validate_global, which updates G (and is collective and potentially expensive), is a way of synchronizing with the other processes in the communicator so everyone knows that everyone has the same version of G.  This way the processes can adjust their collective communication patterns taking into account the failed processes.

One thing I should have perhaps made clear was that while G is always the same for every process, each process may have a different L.

The Comm_validate_local function really just functions to update L so that the user can query for the set.  This prevents the problem where the set might change between when the user got the size of the set and allocated the array, and when the user queried for the set.

-d



On Feb 25, 2011, at 2:47 PM, Graham, Richard L. wrote:

> Why would an application want to distinguish between what you have called
> local processes and global processes ?  Seems that an app would just want
> to ask for the set of failed procs, and then decide if to have local or
> global response.  I can't think of a use case that would deliberately want
> only a partial answer to that question.
> 
> Rich
> 
> On 2/25/11 2:58 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
> 
>> 
>> On the last concall, we discussed the semantics of the comm validate
>> functions, and how to query for failed processes.  Josh asked me to
>> capture these ideas and send it out to the list.  Below is my attempt at
>> this.  Please let me know if I missed anything from the discussion.
>> 
>> Comments are encouraged.
>> 
>> -d
>> 
>> 
>> Processes' Views of Failed Processes
>> ------------------------------------
>> 
>> For each communicator, with the set of processes C = {0, ...,
>> size_of_communicator-1}, consider three sets of failed processes:
>> 
>>   F, the set of actually failed processes in the communicator;
>> 
>>   L, the set of processes of the communicator known to be failed by
>>   the local process; and
>> 
>>   G, the set of processes of the communicator that the local process
>>   knows that every process in the communicator knows to be failed.
>> 
>> Note that C >= F >= L >= G.  Because the process may have an imperfect
>> view of the state of the system, a process may never know F, so we
>> allow the process to query only for L or G.
>> 
>> Determining the Sets of Failed Processes
>> ----------------------------------------
>> 
>> MPI provides the following functions to determine L and G.  [These
>> names can, and probably should, change.]
>> 
>>   MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
>>       local function that checks for failed processes and updates
>>       the set L.  Note that this check for failed processes may be
>>       imperfect, in that even if no additional processes fail L
>>       might not be the same as F.  When the function returns,
>>       num_failed = |L| and num_new will be the number of newly
>>       discovered failed processes, i.e., if L' = L before
>>       MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.
>> 
>>   MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
>>       collective.  Each process in the communicator checks for
>>       failed processes, performs an all-reduce to take the union of
>>       all the processes known to have failed by each process, then
>>       updates the sets G and L.  After this function returns L = G.
>>       The parameters num_failed and num_new are set the same as in
>>       MPI_Comm_validate_local.
>> 
>> Note that the set L will not change outside of a call to
>> MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
>> will not change outside of a call to MPI_Comm_validate_global.
>> 
>> Querying the Sets of Failed Processes
>> -------------------------------------
>> 
>> MPI also provides functions to allow the user to query L and G.
>> 
>>   MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
>>       is a local function.  The parameter failed_type is one of
>>       MPI_FAILED_ALL, MPI_FAILED_NEW.
>> 
>>       When failed_type = MPI_FAILED_ALL, the set L is returned in
>>       failed_array.  The size of the array returned is the value
>>       returned in num_failed in the last call to
>>       MPI_Comm_validate_local or MPI_Comm_validate_global.  The
>>       caller of the function must ensure that failed_array is large
>>       enough to hold num_failed integers.
>> 
>>   When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>   failed processes is returned in failed_array.  I.e., if L' = L
>>   before MPI_Comm_validate_local or MPI_Comm_validate_global was
>>   called, then failed_array contains the list of processes in
>>   the set L - L'.  It is the caller's responsibility to ensure
>>   that the failed_array is large enough to hold num_new
>>   integers, where num_new is the value returned the last time
>>   MPI_Comm_validate_local or MPI_Comm_validate_global was
>>   called.
>> 
>>   MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
>>       is a local function.  The parameter failed_type is one of
>>       MPI_FAILED_ALL, MPI_FAILED_NEW.
>> 
>>       When failed_type = MPI_FAILED_ALL the set G is retured in
>>       failed_array.  The size of the array is the value returned in
>>       num_failed in the last call to MPI_Comm_validate_global.  The
>>       caller of the function must ensure that the failed_array is
>>       large enough to hold num_failed integers.
>> 
>>   When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>   failed processes is returned in failed_array.  I.e., if G' = G
>>   before MPI_Comm_validate_global was called, then failed_array
>>   contains the list of processes in the set G - G'.  It is the
>>   caller's responsibility to ensure that the failed_array is
>>   large enough to hold num_new integers, where num_new is the
>>   value returned the last time MPI_Comm_validate_global was
>>   called.
>> 
>> [Initially, I included a num_uncleared variable in
>> MPI_Comm_validate_local to return the number of uncleared processes,
>> but since the number of cleared processes does not change in that
>> function, it seemed redundant.  Instead, I'm thinking we should have a
>> separate mechansim to query the number and set of cleared/uncleared
>> processes to go along with the functions for clearing a failed
>> process.]
>> 
>> 
>> 
>> 
>> Implementation Notes: Memory Requirements for Maintaining the Sets of
>> Failed Processes
>> ---------------------------------------------------------------------
>> 
>> One set G is needed per communicator.  The set G can be stored in a
>> distributed fashion across the processes of the communicator to reduce
>> the storage needed by each individual process.  [What happens if some
>> of the processes storing the distributed array fail?  We can store
>> them with some redundancy and tolerate K failures.  We can also
>> recompute the set, but we'd need the old value to be able to return
>> the FAILED_NEW set.]
>> 
>> L cannot be stored in a distributed way, because each process may have
>> a different set L.  However, since L >= G, L can be represented as the
>> union of G and L', so only L' needs to be stored locally.  Presumably
>> L' would be small.  Note that the implementation is free to let L' be
>> a subset of the processes it has connected to or tried to connect to.
>> So the set L' would be about as large as the number of connected
>> processes, and can be represented as a state variable (e.g., UP, or
>> DOWN) in the data structure of connected processes.  Of course
>> pathological cases can be found, but these would require (1) that
>> there are a very large number of failed processes, and (2) that the
>> process has tried to connect to a very large number of them.
>> 
>> The storage requirements to keep track of the set of cleared processes
>> is similar to that of storing L.  Since G is a subset of the set of
>> cleared processes which itself is a subset of L, the set of cleared
>> processes can be represented as an additional state in the connected
>> processes data structure (e.g., UP, DOWN or CLEARED).  Note that after
>> a global validate, L = G = the set of cleared processes, so calling
>> the global validate function could potentially free up storage.  A
>> note about this and the memory requirements of clearing processes
>> locally can be included in advice to users.
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft





More information about the mpiwg-ft mailing list