[Mpi3-ft] MPI_Comm_validate parameters
Darius Buntinas
buntinas at mcs.anl.gov
Sat Feb 26 11:09:58 CST 2011
I think we're in agreement here, well, with everything but the part of your first sentence up to the comma :-).
If it seemed to imply that the state of the system can be known reliably by all, then I need to adjust the wording. We never assume that F is known everywhere, rather we use L as an approximation of that. Doing a validate_global just takes the union of the L sets of each process to ensure that everyone has the same view before trying to do a collective.
The rest is what I was trying to convey :-).
-d
On Feb 25, 2011, at 11:13 PM, Graham, Richard L. wrote:
> There seems to be a notion that the state of the system can be know by all
> at some point in the run, and I would argue that it is not possible to
> guarantee this, just because of the time it takes information to
> propagate. So, it is incumbent on us to assume that the information we
> have is outdated and incomplete, and setup the support to reflect this.
>
> So the collective validate will return the state of the communicator just
> before the call, which is sufficient for users to validate the integrity
> of their data. Any subsequent errors will be caught be subsequent MPI
> calls.
>
> Having said that, if the intent of the "local" call is to avoid global
> synchronization, so that communication that is local in nature can proceed
> seems like a good idea.
>
> Rich
>
> On 2/25/11 4:27 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>
>>
>> Well, in an asynchronous system, you can't guarantee that every process
>> has the complete picture of failed processes, so you can only count on a
>> partial answer. And this is what the set L is: a process's approximation
>> of F. Recall that G is a SUBSET of F, not the other way around.
>>
>> So the question is really why do we need G? Comm_validate_global, which
>> updates G (and is collective and potentially expensive), is a way of
>> synchronizing with the other processes in the communicator so everyone
>> knows that everyone has the same version of G. This way the processes
>> can adjust their collective communication patterns taking into account
>> the failed processes.
>>
>> One thing I should have perhaps made clear was that while G is always the
>> same for every process, each process may have a different L.
>>
>> The Comm_validate_local function really just functions to update L so
>> that the user can query for the set. This prevents the problem where the
>> set might change between when the user got the size of the set and
>> allocated the array, and when the user queried for the set.
>>
>> -d
>>
>>
>>
>> On Feb 25, 2011, at 2:47 PM, Graham, Richard L. wrote:
>>
>>> Why would an application want to distinguish between what you have
>>> called
>>> local processes and global processes ? Seems that an app would just
>>> want
>>> to ask for the set of failed procs, and then decide if to have local or
>>> global response. I can't think of a use case that would deliberately
>>> want
>>> only a partial answer to that question.
>>>
>>> Rich
>>>
>>> On 2/25/11 2:58 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>>>
>>>>
>>>> On the last concall, we discussed the semantics of the comm validate
>>>> functions, and how to query for failed processes. Josh asked me to
>>>> capture these ideas and send it out to the list. Below is my attempt
>>>> at
>>>> this. Please let me know if I missed anything from the discussion.
>>>>
>>>> Comments are encouraged.
>>>>
>>>> -d
>>>>
>>>>
>>>> Processes' Views of Failed Processes
>>>> ------------------------------------
>>>>
>>>> For each communicator, with the set of processes C = {0, ...,
>>>> size_of_communicator-1}, consider three sets of failed processes:
>>>>
>>>> F, the set of actually failed processes in the communicator;
>>>>
>>>> L, the set of processes of the communicator known to be failed by
>>>> the local process; and
>>>>
>>>> G, the set of processes of the communicator that the local process
>>>> knows that every process in the communicator knows to be failed.
>>>>
>>>> Note that C >= F >= L >= G. Because the process may have an imperfect
>>>> view of the state of the system, a process may never know F, so we
>>>> allow the process to query only for L or G.
>>>>
>>>> Determining the Sets of Failed Processes
>>>> ----------------------------------------
>>>>
>>>> MPI provides the following functions to determine L and G. [These
>>>> names can, and probably should, change.]
>>>>
>>>> MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
>>>> local function that checks for failed processes and updates
>>>> the set L. Note that this check for failed processes may be
>>>> imperfect, in that even if no additional processes fail L
>>>> might not be the same as F. When the function returns,
>>>> num_failed = |L| and num_new will be the number of newly
>>>> discovered failed processes, i.e., if L' = L before
>>>> MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.
>>>>
>>>> MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
>>>> collective. Each process in the communicator checks for
>>>> failed processes, performs an all-reduce to take the union of
>>>> all the processes known to have failed by each process, then
>>>> updates the sets G and L. After this function returns L = G.
>>>> The parameters num_failed and num_new are set the same as in
>>>> MPI_Comm_validate_local.
>>>>
>>>> Note that the set L will not change outside of a call to
>>>> MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
>>>> will not change outside of a call to MPI_Comm_validate_global.
>>>>
>>>> Querying the Sets of Failed Processes
>>>> -------------------------------------
>>>>
>>>> MPI also provides functions to allow the user to query L and G.
>>>>
>>>> MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
>>>> is a local function. The parameter failed_type is one of
>>>> MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>>
>>>> When failed_type = MPI_FAILED_ALL, the set L is returned in
>>>> failed_array. The size of the array returned is the value
>>>> returned in num_failed in the last call to
>>>> MPI_Comm_validate_local or MPI_Comm_validate_global. The
>>>> caller of the function must ensure that failed_array is large
>>>> enough to hold num_failed integers.
>>>>
>>>> When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>> failed processes is returned in failed_array. I.e., if L' = L
>>>> before MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>> called, then failed_array contains the list of processes in
>>>> the set L - L'. It is the caller's responsibility to ensure
>>>> that the failed_array is large enough to hold num_new
>>>> integers, where num_new is the value returned the last time
>>>> MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>> called.
>>>>
>>>> MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
>>>> is a local function. The parameter failed_type is one of
>>>> MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>>
>>>> When failed_type = MPI_FAILED_ALL the set G is retured in
>>>> failed_array. The size of the array is the value returned in
>>>> num_failed in the last call to MPI_Comm_validate_global. The
>>>> caller of the function must ensure that the failed_array is
>>>> large enough to hold num_failed integers.
>>>>
>>>> When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>> failed processes is returned in failed_array. I.e., if G' = G
>>>> before MPI_Comm_validate_global was called, then failed_array
>>>> contains the list of processes in the set G - G'. It is the
>>>> caller's responsibility to ensure that the failed_array is
>>>> large enough to hold num_new integers, where num_new is the
>>>> value returned the last time MPI_Comm_validate_global was
>>>> called.
>>>>
>>>> [Initially, I included a num_uncleared variable in
>>>> MPI_Comm_validate_local to return the number of uncleared processes,
>>>> but since the number of cleared processes does not change in that
>>>> function, it seemed redundant. Instead, I'm thinking we should have a
>>>> separate mechansim to query the number and set of cleared/uncleared
>>>> processes to go along with the functions for clearing a failed
>>>> process.]
>>>>
>>>>
>>>>
>>>>
>>>> Implementation Notes: Memory Requirements for Maintaining the Sets of
>>>> Failed Processes
>>>> ---------------------------------------------------------------------
>>>>
>>>> One set G is needed per communicator. The set G can be stored in a
>>>> distributed fashion across the processes of the communicator to reduce
>>>> the storage needed by each individual process. [What happens if some
>>>> of the processes storing the distributed array fail? We can store
>>>> them with some redundancy and tolerate K failures. We can also
>>>> recompute the set, but we'd need the old value to be able to return
>>>> the FAILED_NEW set.]
>>>>
>>>> L cannot be stored in a distributed way, because each process may have
>>>> a different set L. However, since L >= G, L can be represented as the
>>>> union of G and L', so only L' needs to be stored locally. Presumably
>>>> L' would be small. Note that the implementation is free to let L' be
>>>> a subset of the processes it has connected to or tried to connect to.
>>>> So the set L' would be about as large as the number of connected
>>>> processes, and can be represented as a state variable (e.g., UP, or
>>>> DOWN) in the data structure of connected processes. Of course
>>>> pathological cases can be found, but these would require (1) that
>>>> there are a very large number of failed processes, and (2) that the
>>>> process has tried to connect to a very large number of them.
>>>>
>>>> The storage requirements to keep track of the set of cleared processes
>>>> is similar to that of storing L. Since G is a subset of the set of
>>>> cleared processes which itself is a subset of L, the set of cleared
>>>> processes can be represented as an additional state in the connected
>>>> processes data structure (e.g., UP, DOWN or CLEARED). Note that after
>>>> a global validate, L = G = the set of cleared processes, so calling
>>>> the global validate function could potentially free up storage. A
>>>> note about this and the memory requirements of clearing processes
>>>> locally can be included in advice to users.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>
>>>
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>
>>
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list