[Mpi3-ft] MPI_Comm_validate parameters

Sat Feb 26 11:09:58 CST 2011

I think we're in agreement here, well, with everything but the part of your first sentence up to the comma :-).

If it seemed to imply that the state of the system can be known reliably by all, then I need to adjust the wording.  We never assume that F is known everywhere, rather we use L as an approximation of that.  Doing a validate_global just takes the union of the L sets of each process to ensure that everyone has the same view before trying to do a collective.

The rest is what I was trying to convey :-).

-d

On Feb 25, 2011, at 11:13 PM, Graham, Richard L. wrote:

> There seems to be a notion that the state of the system can be know by all
> at some point in the run, and I would argue that it is not possible to
> guarantee this, just because of the time it takes information to
> propagate.  So, it is incumbent on us to assume that the information we
> have is outdated and incomplete, and setup the support to reflect this.
> 
> So the collective validate will return the state of the communicator just
> before the call, which is sufficient for users to validate the integrity
> of their data.  Any subsequent errors will be caught be subsequent MPI
> calls.
> 
> Having said that, if the intent of the "local" call is to avoid global
> synchronization, so that communication that is local in nature can proceed
> seems like a good idea.
> 
> Rich
> 
> On 2/25/11 4:27 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
> 
>> 
>> Well, in an asynchronous system, you can't guarantee that every process
>> has the complete picture of failed processes, so you can only count on a
>> partial answer. And this is what the set L is: a process's approximation
>> of F.  Recall that G is a SUBSET of F, not the other way around.
>> 
>> So the question is really why do we need G?  Comm_validate_global, which
>> updates G (and is collective and potentially expensive), is a way of
>> synchronizing with the other processes in the communicator so everyone
>> knows that everyone has the same version of G.  This way the processes
>> can adjust their collective communication patterns taking into account
>> the failed processes.
>> 
>> One thing I should have perhaps made clear was that while G is always the
>> same for every process, each process may have a different L.
>> 
>> The Comm_validate_local function really just functions to update L so
>> that the user can query for the set.  This prevents the problem where the
>> set might change between when the user got the size of the set and
>> allocated the array, and when the user queried for the set.
>> 
>> -d
>> 
>> 
>> 
>> On Feb 25, 2011, at 2:47 PM, Graham, Richard L. wrote:
>> 
>>> Why would an application want to distinguish between what you have
>>> called
>>> local processes and global processes ?  Seems that an app would just
>>> want
>>> to ask for the set of failed procs, and then decide if to have local or
>>> global response.  I can't think of a use case that would deliberately
>>> want
>>> only a partial answer to that question.
>>> 
>>> Rich
>>> 
>>> On 2/25/11 2:58 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>>> 
>>>> 
>>>> On the last concall, we discussed the semantics of the comm validate
>>>> functions, and how to query for failed processes.  Josh asked me to
>>>> capture these ideas and send it out to the list.  Below is my attempt
>>>> at
>>>> this.  Please let me know if I missed anything from the discussion.
>>>> 
>>>> Comments are encouraged.
>>>> 
>>>> -d
>>>> 
>>>> 
>>>> Processes' Views of Failed Processes
>>>> ------------------------------------
>>>> 
>>>> For each communicator, with the set of processes C = {0, ...,
>>>> size_of_communicator-1}, consider three sets of failed processes:
>>>> 
>>>>  F, the set of actually failed processes in the communicator;
>>>> 
>>>>  L, the set of processes of the communicator known to be failed by
>>>>  the local process; and
>>>> 
>>>>  G, the set of processes of the communicator that the local process
>>>>  knows that every process in the communicator knows to be failed.
>>>> 
>>>> Note that C >= F >= L >= G.  Because the process may have an imperfect
>>>> view of the state of the system, a process may never know F, so we
>>>> allow the process to query only for L or G.
>>>> 
>>>> Determining the Sets of Failed Processes
>>>> ----------------------------------------
>>>> 
>>>> MPI provides the following functions to determine L and G.  [These
>>>> names can, and probably should, change.]
>>>> 
>>>>  MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
>>>>      local function that checks for failed processes and updates
>>>>      the set L.  Note that this check for failed processes may be
>>>>      imperfect, in that even if no additional processes fail L
>>>>      might not be the same as F.  When the function returns,
>>>>      num_failed = |L| and num_new will be the number of newly
>>>>      discovered failed processes, i.e., if L' = L before
>>>>      MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.
>>>> 
>>>>  MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
>>>>      collective.  Each process in the communicator checks for
>>>>      failed processes, performs an all-reduce to take the union of
>>>>      all the processes known to have failed by each process, then
>>>>      updates the sets G and L.  After this function returns L = G.
>>>>      The parameters num_failed and num_new are set the same as in
>>>>      MPI_Comm_validate_local.
>>>> 
>>>> Note that the set L will not change outside of a call to
>>>> MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
>>>> will not change outside of a call to MPI_Comm_validate_global.
>>>> 
>>>> Querying the Sets of Failed Processes
>>>> -------------------------------------
>>>> 
>>>> MPI also provides functions to allow the user to query L and G.
>>>> 
>>>>  MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
>>>>      is a local function.  The parameter failed_type is one of
>>>>      MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>> 
>>>>      When failed_type = MPI_FAILED_ALL, the set L is returned in
>>>>      failed_array.  The size of the array returned is the value
>>>>      returned in num_failed in the last call to
>>>>      MPI_Comm_validate_local or MPI_Comm_validate_global.  The
>>>>      caller of the function must ensure that failed_array is large
>>>>      enough to hold num_failed integers.
>>>> 
>>>>  When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>>  failed processes is returned in failed_array.  I.e., if L' = L
>>>>  before MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>>  called, then failed_array contains the list of processes in
>>>>  the set L - L'.  It is the caller's responsibility to ensure
>>>>  that the failed_array is large enough to hold num_new
>>>>  integers, where num_new is the value returned the last time
>>>>  MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>>  called.
>>>> 
>>>>  MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
>>>>      is a local function.  The parameter failed_type is one of
>>>>      MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>> 
>>>>      When failed_type = MPI_FAILED_ALL the set G is retured in
>>>>      failed_array.  The size of the array is the value returned in
>>>>      num_failed in the last call to MPI_Comm_validate_global.  The
>>>>      caller of the function must ensure that the failed_array is
>>>>      large enough to hold num_failed integers.
>>>> 
>>>>  When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>>  failed processes is returned in failed_array.  I.e., if G' = G
>>>>  before MPI_Comm_validate_global was called, then failed_array
>>>>  contains the list of processes in the set G - G'.  It is the
>>>>  caller's responsibility to ensure that the failed_array is
>>>>  large enough to hold num_new integers, where num_new is the
>>>>  value returned the last time MPI_Comm_validate_global was
>>>>  called.
>>>> 
>>>> [Initially, I included a num_uncleared variable in
>>>> MPI_Comm_validate_local to return the number of uncleared processes,
>>>> but since the number of cleared processes does not change in that
>>>> function, it seemed redundant.  Instead, I'm thinking we should have a
>>>> separate mechansim to query the number and set of cleared/uncleared
>>>> processes to go along with the functions for clearing a failed
>>>> process.]
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Implementation Notes: Memory Requirements for Maintaining the Sets of
>>>> Failed Processes
>>>> ---------------------------------------------------------------------
>>>> 
>>>> One set G is needed per communicator.  The set G can be stored in a
>>>> distributed fashion across the processes of the communicator to reduce
>>>> the storage needed by each individual process.  [What happens if some
>>>> of the processes storing the distributed array fail?  We can store
>>>> them with some redundancy and tolerate K failures.  We can also
>>>> recompute the set, but we'd need the old value to be able to return
>>>> the FAILED_NEW set.]
>>>> 
>>>> L cannot be stored in a distributed way, because each process may have
>>>> a different set L.  However, since L >= G, L can be represented as the
>>>> union of G and L', so only L' needs to be stored locally.  Presumably
>>>> L' would be small.  Note that the implementation is free to let L' be
>>>> a subset of the processes it has connected to or tried to connect to.
>>>> So the set L' would be about as large as the number of connected
>>>> processes, and can be represented as a state variable (e.g., UP, or
>>>> DOWN) in the data structure of connected processes.  Of course
>>>> pathological cases can be found, but these would require (1) that
>>>> there are a very large number of failed processes, and (2) that the
>>>> process has tried to connect to a very large number of them.
>>>> 
>>>> The storage requirements to keep track of the set of cleared processes
>>>> is similar to that of storing L.  Since G is a subset of the set of
>>>> cleared processes which itself is a subset of L, the set of cleared
>>>> processes can be represented as an additional state in the connected
>>>> processes data structure (e.g., UP, DOWN or CLEARED).  Note that after
>>>> a global validate, L = G = the set of cleared processes, so calling
>>>> the global validate function could potentially free up storage.  A
>>>> note about this and the memory requirements of clearing processes
>>>> locally can be included in advice to users.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft