[Mpi3-ft] MPI_Comm_validate parameters

Mon Feb 28 09:16:09 CST 2011

Darius - Thanks for putting this together. I think it captures our conversation on the teleconf really well.

I actually do not mind the 'local' and 'global' specifiers on the function names. I think it helps make explicit the scope of the call, whereas '_all' is a bit more subtle.

Below is a draft of some prose that I was thinking of adding to the proposal (probably after the discussion of a "perfect detector" - let me know what you think:
-----------------------
Let "F" be the full set of process failures in any given communicator at any point in time. It cannot be assumed that any one process will know "F" at all points in time - due to propagation delay, implementation constraints, etc. Since a perfect failure detector is assumed, the strong completeness attribute provides that eventually all active processes will know of all process failures ("F"), but not necessarily at the same time. The interfaces provided allow an application to query for a locally consistent set of process failures ("L") and a globally consistent set of process failures ("G"), defined below.

Let "L" be a subset of failed processes in a given communicator consistent to the local process at some point in time ("L <= F"). The ability to query for just "L", even though it is not globally consistent, is useful for applications that rely on point-to-point communication and do not often need a globally consistent view of the failed processes in the communicator.

Let "G" be a subset of failed processes in a given communicator consistent between the full set of active processes in the communicator at some point in time ("G <= F"). "G" represents the union of all process failures known to at each active process ("L") in the communicator at some point in time ("G = union of all L at logical time T"). Since a local process may know if additional failures, "G" is known as a subset of "L" ("G <= L"). Globally consistent views of the number of process failures in a communicator are useful for applications that rely on periodic global synchronization and collective operations.
-----------------------

As far as an interface to these functions, I have a few additional thoughts.

Since we have a MPI_Rank_info object that conveys the {rank,state,generation} information we might want to think about a couple more generic query functions. As we noted earlier, users may want to know of various subsets of the processes depending on their particular need. So I was thinking about allowing the states to be bitwise or'ed together.

We have the following states (prefix with MPI_RANK_STATE_):
 - OK (active)
 - FAILED (failed, unrecognized)
 - NULL (failed, recognized)

We could add a few new modifiers (prefix with MPI_RANK_STATE_MOD_):
 - NEW (since last call to {global|local} validate)
 - OLD (before last call to {global|local} validate)
 - RECOGNIZED (-- maybe to replace the NULL state above?)

To determine "L" or "G" they would use the following functions:
----------------------------
MPI_Comm_validate_local(comm, &num_failed)
  - Local operation
  - Update L
  - num_failed = |L| (both recognized and unrecognized)

MPI_Comm_validate_global(comm, &num_failed)
  - Collective operation
  - Update G
  - Update L = G
  - num_failed = |L| = |G|
----------------------------

Accessors have the following properties:
 - These are local operations
 - None of them modify "L" or "G"
 - Take an or'ed list of states and modifiers to determine 'type'
 - If incount = 0, then outcount = |L| or |G|, rank_infos ignored

----------------------------
MPI_Comm_get_state_local(comm, type, incount, &outcount, rank_infos[])
 - Local operation
 - Returns the set of processes in "L" that match the 'type' specified
 - outcount = min(incount, |L|)
 - MPI_ERR_SIZE if incount != 0 and incount < |L|

MPI_Comm_get_state_global(comm, type, incount, &outcount, rank_infos[])
 - Local operation
 - Returns the set of processes in "G" that match the 'type' specified
 - outcount = min(incount, |G|)
 - MPI_ERR_SIZE if incount != 0 and incount < |G|
----------------------------

So an application can do something like:
------------
MPI_Comm_validate_global(comm, &num_failed_start);
/* Do work */
MPI_Comm_validate_global(comm, &num_failed_end);

if( num_failed_start < num_failed_end ) { /* something failed */
  incount = 0;
  MPI_Comm_get_state_global(comm,
    MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
    incount, &outcount, NULL);
  rank_infos = malloc(... * outcount);
  incount = outcount;
  MPI_Comm_get_state_global(comm,
    MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
    incount, &outcount, rank_infos);
}
------------

Instead of having the 'if incount = 0' rule, we could just introduce a new function like:
----------------------------
MPI_Comm_get_num_state_local(comm, type, &count);
MPI_Comm_get_num_state_global(comm, type, &count);
----------------------------
Seems a bit cleaner, but adds a couple new functions. I have no real preference here.

What do you all think about these new interface variations?

-- Josh

On Feb 26, 2011, at 12:09 PM, Darius Buntinas wrote:

> I think we're in agreement here, well, with everything but the part of your first sentence up to the comma :-).
> 
> If it seemed to imply that the state of the system can be known reliably by all, then I need to adjust the wording.  We never assume that F is known everywhere, rather we use L as an approximation of that.  Doing a validate_global just takes the union of the L sets of each process to ensure that everyone has the same view before trying to do a collective.
> 
> The rest is what I was trying to convey :-).
> 
> -d
> 
> On Feb 25, 2011, at 11:13 PM, Graham, Richard L. wrote:
> 
>> There seems to be a notion that the state of the system can be know by all
>> at some point in the run, and I would argue that it is not possible to
>> guarantee this, just because of the time it takes information to
>> propagate.  So, it is incumbent on us to assume that the information we
>> have is outdated and incomplete, and setup the support to reflect this.
>> 
>> So the collective validate will return the state of the communicator just
>> before the call, which is sufficient for users to validate the integrity
>> of their data.  Any subsequent errors will be caught be subsequent MPI
>> calls.
>> 
>> Having said that, if the intent of the "local" call is to avoid global
>> synchronization, so that communication that is local in nature can proceed
>> seems like a good idea.
>> 
>> Rich
>> 
>> On 2/25/11 4:27 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>> 
>>> 
>>> Well, in an asynchronous system, you can't guarantee that every process
>>> has the complete picture of failed processes, so you can only count on a
>>> partial answer. And this is what the set L is: a process's approximation
>>> of F.  Recall that G is a SUBSET of F, not the other way around.
>>> 
>>> So the question is really why do we need G?  Comm_validate_global, which
>>> updates G (and is collective and potentially expensive), is a way of
>>> synchronizing with the other processes in the communicator so everyone
>>> knows that everyone has the same version of G.  This way the processes
>>> can adjust their collective communication patterns taking into account
>>> the failed processes.
>>> 
>>> One thing I should have perhaps made clear was that while G is always the
>>> same for every process, each process may have a different L.
>>> 
>>> The Comm_validate_local function really just functions to update L so
>>> that the user can query for the set.  This prevents the problem where the
>>> set might change between when the user got the size of the set and
>>> allocated the array, and when the user queried for the set.
>>> 
>>> -d
>>> 
>>> 
>>> 
>>> On Feb 25, 2011, at 2:47 PM, Graham, Richard L. wrote:
>>> 
>>>> Why would an application want to distinguish between what you have
>>>> called
>>>> local processes and global processes ?  Seems that an app would just
>>>> want
>>>> to ask for the set of failed procs, and then decide if to have local or
>>>> global response.  I can't think of a use case that would deliberately
>>>> want
>>>> only a partial answer to that question.
>>>> 
>>>> Rich
>>>> 
>>>> On 2/25/11 2:58 PM, "Darius Buntinas" <buntinas at mcs.anl.gov> wrote:
>>>> 
>>>>> 
>>>>> On the last concall, we discussed the semantics of the comm validate
>>>>> functions, and how to query for failed processes.  Josh asked me to
>>>>> capture these ideas and send it out to the list.  Below is my attempt
>>>>> at
>>>>> this.  Please let me know if I missed anything from the discussion.
>>>>> 
>>>>> Comments are encouraged.
>>>>> 
>>>>> -d
>>>>> 
>>>>> 
>>>>> Processes' Views of Failed Processes
>>>>> ------------------------------------
>>>>> 
>>>>> For each communicator, with the set of processes C = {0, ...,
>>>>> size_of_communicator-1}, consider three sets of failed processes:
>>>>> 
>>>>> F, the set of actually failed processes in the communicator;
>>>>> 
>>>>> L, the set of processes of the communicator known to be failed by
>>>>> the local process; and
>>>>> 
>>>>> G, the set of processes of the communicator that the local process
>>>>> knows that every process in the communicator knows to be failed.
>>>>> 
>>>>> Note that C >= F >= L >= G.  Because the process may have an imperfect
>>>>> view of the state of the system, a process may never know F, so we
>>>>> allow the process to query only for L or G.
>>>>> 
>>>>> Determining the Sets of Failed Processes
>>>>> ----------------------------------------
>>>>> 
>>>>> MPI provides the following functions to determine L and G.  [These
>>>>> names can, and probably should, change.]
>>>>> 
>>>>> MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
>>>>>     local function that checks for failed processes and updates
>>>>>     the set L.  Note that this check for failed processes may be
>>>>>     imperfect, in that even if no additional processes fail L
>>>>>     might not be the same as F.  When the function returns,
>>>>>     num_failed = |L| and num_new will be the number of newly
>>>>>     discovered failed processes, i.e., if L' = L before
>>>>>     MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.
>>>>> 
>>>>> MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
>>>>>     collective.  Each process in the communicator checks for
>>>>>     failed processes, performs an all-reduce to take the union of
>>>>>     all the processes known to have failed by each process, then
>>>>>     updates the sets G and L.  After this function returns L = G.
>>>>>     The parameters num_failed and num_new are set the same as in
>>>>>     MPI_Comm_validate_local.
>>>>> 
>>>>> Note that the set L will not change outside of a call to
>>>>> MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
>>>>> will not change outside of a call to MPI_Comm_validate_global.
>>>>> 
>>>>> Querying the Sets of Failed Processes
>>>>> -------------------------------------
>>>>> 
>>>>> MPI also provides functions to allow the user to query L and G.
>>>>> 
>>>>> MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
>>>>>     is a local function.  The parameter failed_type is one of
>>>>>     MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>>> 
>>>>>     When failed_type = MPI_FAILED_ALL, the set L is returned in
>>>>>     failed_array.  The size of the array returned is the value
>>>>>     returned in num_failed in the last call to
>>>>>     MPI_Comm_validate_local or MPI_Comm_validate_global.  The
>>>>>     caller of the function must ensure that failed_array is large
>>>>>     enough to hold num_failed integers.
>>>>> 
>>>>> When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>>> failed processes is returned in failed_array.  I.e., if L' = L
>>>>> before MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>>> called, then failed_array contains the list of processes in
>>>>> the set L - L'.  It is the caller's responsibility to ensure
>>>>> that the failed_array is large enough to hold num_new
>>>>> integers, where num_new is the value returned the last time
>>>>> MPI_Comm_validate_local or MPI_Comm_validate_global was
>>>>> called.
>>>>> 
>>>>> MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
>>>>>     is a local function.  The parameter failed_type is one of
>>>>>     MPI_FAILED_ALL, MPI_FAILED_NEW.
>>>>> 
>>>>>     When failed_type = MPI_FAILED_ALL the set G is retured in
>>>>>     failed_array.  The size of the array is the value returned in
>>>>>     num_failed in the last call to MPI_Comm_validate_global.  The
>>>>>     caller of the function must ensure that the failed_array is
>>>>>     large enough to hold num_failed integers.
>>>>> 
>>>>> When failed_type = MPI_FAILED_NEW, the set of newly discovered
>>>>> failed processes is returned in failed_array.  I.e., if G' = G
>>>>> before MPI_Comm_validate_global was called, then failed_array
>>>>> contains the list of processes in the set G - G'.  It is the
>>>>> caller's responsibility to ensure that the failed_array is
>>>>> large enough to hold num_new integers, where num_new is the
>>>>> value returned the last time MPI_Comm_validate_global was
>>>>> called.
>>>>> 
>>>>> [Initially, I included a num_uncleared variable in
>>>>> MPI_Comm_validate_local to return the number of uncleared processes,
>>>>> but since the number of cleared processes does not change in that
>>>>> function, it seemed redundant.  Instead, I'm thinking we should have a
>>>>> separate mechansim to query the number and set of cleared/uncleared
>>>>> processes to go along with the functions for clearing a failed
>>>>> process.]
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Implementation Notes: Memory Requirements for Maintaining the Sets of
>>>>> Failed Processes
>>>>> ---------------------------------------------------------------------
>>>>> 
>>>>> One set G is needed per communicator.  The set G can be stored in a
>>>>> distributed fashion across the processes of the communicator to reduce
>>>>> the storage needed by each individual process.  [What happens if some
>>>>> of the processes storing the distributed array fail?  We can store
>>>>> them with some redundancy and tolerate K failures.  We can also
>>>>> recompute the set, but we'd need the old value to be able to return
>>>>> the FAILED_NEW set.]
>>>>> 
>>>>> L cannot be stored in a distributed way, because each process may have
>>>>> a different set L.  However, since L >= G, L can be represented as the
>>>>> union of G and L', so only L' needs to be stored locally.  Presumably
>>>>> L' would be small.  Note that the implementation is free to let L' be
>>>>> a subset of the processes it has connected to or tried to connect to.
>>>>> So the set L' would be about as large as the number of connected
>>>>> processes, and can be represented as a state variable (e.g., UP, or
>>>>> DOWN) in the data structure of connected processes.  Of course
>>>>> pathological cases can be found, but these would require (1) that
>>>>> there are a very large number of failed processes, and (2) that the
>>>>> process has tried to connect to a very large number of them.
>>>>> 
>>>>> The storage requirements to keep track of the set of cleared processes
>>>>> is similar to that of storing L.  Since G is a subset of the set of
>>>>> cleared processes which itself is a subset of L, the set of cleared
>>>>> processes can be represented as an additional state in the connected
>>>>> processes data structure (e.g., UP, DOWN or CLEARED).  Note that after
>>>>> a global validate, L = G = the set of cleared processes, so calling
>>>>> the global validate function could potentially free up storage.  A
>>>>> note about this and the memory requirements of clearing processes
>>>>> locally can be included in advice to users.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> mpi3-ft mailing list
>>>>> mpi3-ft at lists.mpi-forum.org
>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>>> 
>>>> 
>>>> _______________________________________________
>>>> mpi3-ft mailing list
>>>> mpi3-ft at lists.mpi-forum.org
>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>>> 
>>> 
>>> _______________________________________________
>>> mpi3-ft mailing list
>>> mpi3-ft at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> mpi3-ft at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey