[Mpi3-ft] MPI_Comm_validate parameters

Mon Feb 28 13:50:19 CST 2011

On Feb 28, 2011, at 12:52 PM, Darius Buntinas wrote:

> 
> On Feb 28, 2011, at 9:16 AM, Joshua Hursey wrote:
> 
>> Darius - Thanks for putting this together. I think it captures our conversation on the teleconf really well.
>> 
>> I actually do not mind the 'local' and 'global' specifiers on the function names. I think it helps make explicit the scope of the call, whereas '_all' is a bit more subtle.
> 
> My only concern is people's definition of global.  The global operations only cover the communicator, not comm_world.  I would slightly prefer MPI_Comm_validate_local and MPI_Comm_validate (drop _global), but not by much.  I don't think we need to decide this now, but for the time being we should keep using global to disambiguate our discussion.

Sounds good. Greg provided the alternative of '_collective'. I'm also re-thinking the '_all' postfix. But I agree not something we need to decide just now, and keeping the discussion as 'local' and 'global' should keep us on the same page.

> 
>> Below is a draft of some prose that I was thinking of adding to the proposal (probably after the discussion of a "perfect detector" - let me know what you think:
> 
> This is good, I included a few minor suggestions.

Reworked block below
-----------------------
Let "F" be the full set of process failures in any given communicator at any point in time. It cannot be assumed that any one process will know "F" at all points in time - due to propagation delay, implementation constraints, etc. Since a perfect failure detector is assumed, the strong completeness attribute provides that eventually all active processes will know of all process failures ("F"), but not necessarily at the same time. The interfaces provided allow an application to query for a locally consistent set of process failures ("L_i") and a globally consistent set of process failures ("G") in any given communicator, defined below.

Let "L_i" be a subset of failed processes in a given communicator consistent to the local process ("P_i") at some point in time ("L_i <= F"). The ability to query for just "L_i", even though it is not globally consistent, is useful for applications that rely on point-to-point communication and do not often need a globally consistent view of the failed processes in the communicator.

Let "G" be a subset of failed processes in a given communicator consistent between the full set of active processes in the communicator at some point in time ("G <= F"). "G" represents the union of all process failures known to at each active process ("L_i") in the communicator at some point in time ("G = union of all L_i at logical time T"). It is possible for "G" to be a subset of "L_i" ("G <= L_i") since, at some point after logical time T, an active process may locally discover additional failed processes in a given communicator. Globally consistent views of the number of process failures in a communicator are useful for applications that rely on periodic global synchronization and collective operations.
-----------------------

> 
>> -----------------------
>> Let "F" be the full set of process failures in any given communicator at any point in time. It cannot be assumed that any one process will know "F" at all points in time - due to propagation delay, implementation constraints, etc. Since a perfect failure detector is assumed, the strong completeness attribute provides that eventually all active processes
> 
> maybe add "in the communicator" here?
> 
>> will know of all process failures ("F"), but not necessarily at the same time. The interfaces provided allow an application to query for a locally consistent set of process failures ("L") and a globally consistent set of process failures ("G"), defined below.
> 
> Maybe we should use L_i notation to emphasize that L is local to a process.
> 
>> Let "L" be a subset of failed processes in a given communicator consistent to the local process at some point in time ("L <= F"). The ability to query for just "L", even though it is not globally consistent, is useful for applications that rely on point-to-point communication and do not often need a globally consistent view of the failed processes in the communicator.
>> 
>> Let "G" be a subset of failed processes in a given communicator consistent between the full set of active processes in the communicator at some point in time ("G <= F"). "G" represents the union of all process failures known to at each active process ("L") in the communicator at some point in time ("G = union of all L at logical time T").
> 
> Using L_i notation we can be more precise in "union of all L".
> 
>>                                                                              Since a local process may know if additional failures, "G" is known as a subset of "L" ("G <= L").
> 
> Maybe change this sentence to "Since at some point after logical time T a process may learn about additional failed processes locally, adding them to set "L", "G" is a subset of "L" ("G" <= "L")."  because at time T, the process doesn't know of any additional failures.
> 
>> Globally consistent views of the number of process failures in a communicator are useful for applications that rely on periodic global synchronization and collective operations.
>> -----------------------
>> 
>> 
>> As far as an interface to these functions, I have a few additional thoughts.
>> 
>> Since we have a MPI_Rank_info object that conveys the {rank,state,generation} information we might want to think about a couple more generic query functions. As we noted earlier, users may want to know of various subsets of the processes depending on their particular need. So I was thinking about allowing the states to be bitwise or'ed together.
>> 
>> We have the following states (prefix with MPI_RANK_STATE_):
>> - OK (active)
>> - FAILED (failed, unrecognized)
>> - NULL (failed, recognized)
>> 
>> We could add a few new modifiers (prefix with MPI_RANK_STATE_MOD_):
>> - NEW (since last call to {global|local} validate)
>> - OLD (before last call to {global|local} validate)
>> - RECOGNIZED (-- maybe to replace the NULL state above?
> 
> I like this idea.

The idea of or'ing states, or the idea of having a 'Recognized' modifier, or both?

> 
>> To determine "L" or "G" they would use the following functions:
>> ----------------------------
>> MPI_Comm_validate_local(comm, &num_failed)
>> - Local operation
>> - Update L
>> - num_failed = |L| (both recognized and unrecognized)
>> 
>> MPI_Comm_validate_global(comm, &num_failed)
>> - Collective operation
>> - Update G
>> - Update L = G
>> - num_failed = |L| = |G|
>> ----------------------------
>> 
>> 
>> Accessors have the following properties:
>> - These are local operations
>> - None of them modify "L" or "G"
>> - Take an or'ed list of states and modifiers to determine 'type'
>> - If incount = 0, then outcount = |L| or |G|, rank_infos ignored
>> 
>> ----------------------------
>> MPI_Comm_get_state_local(comm, type, incount, &outcount, rank_infos[])
>> - Local operation
>> - Returns the set of processes in "L" that match the 'type' specified
>> - outcount = min(incount, |L|)
>> - MPI_ERR_SIZE if incount != 0 and incount < |L|
>> 
>> MPI_Comm_get_state_global(comm, type, incount, &outcount, rank_infos[])
>> - Local operation
>> - Returns the set of processes in "G" that match the 'type' specified
>> - outcount = min(incount, |G|)
>> - MPI_ERR_SIZE if incount != 0 and incount < |G|
>> ----------------------------
>> 
>> 
>> So an application can do something like:
>> ------------
>> MPI_Comm_validate_global(comm, &num_failed_start);
>> /* Do work */
>> MPI_Comm_validate_global(comm, &num_failed_end);
>> 
>> if( num_failed_start < num_failed_end ) { /* something failed */
>> incount = 0;
>> MPI_Comm_get_state_global(comm,
>>   MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>   incount, &outcount, NULL);
>> rank_infos = malloc(... * outcount);
>> incount = outcount;
>> MPI_Comm_get_state_global(comm,
>>   MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>   incount, &outcount, rank_infos);
>> }
>> ------------
>> 
>> Instead of having the 'if incount = 0' rule, we could just introduce a new function like:
>> ----------------------------
>> MPI_Comm_get_num_state_local(comm, type, &count);
>> MPI_Comm_get_num_state_global(comm, type, &count);
> 
> In that case we can even replace num_failed in the comm_validate functions with a flag: new_failures.  Then use the above to get the counts.

Or even better, eliminate the second argument from the MPI_Comm_validate_{local|global}, and just pass the communicator to it - similar to MPI_Barrier. Since the accessor functions are always related to the last update call there is no real need (other than shorthand) to have the additional parameter.

----------------------------
MPI_Comm_validate_local(comm)
 - Local operation
 - Update L

MPI_Comm_validate_global(comm)
 - Collective operation
 - Update G
 - Update L = G

MPI_Comm_get_num_state_local(comm, type, &count)
 - 'type' biwise or of states and modifiers
 - count = |L|

MPI_Comm_get_num_state_global(comm, type, &count);
 - 'type' biwise or of states and modifiers
 - count = |G|

MPI_Comm_get_state_local(comm, type, incount, &outcount, rank_infos[])
- Local operation
- Returns the set of processes in "L" that match the 'type' specified
- outcount = min(incount, |L|)
- MPI_ERR_SIZE if incount < |L|

MPI_Comm_get_state_global(comm, type, incount, &outcount, rank_infos[])
- Local operation
- Returns the set of processes in "G" that match the 'type' specified
- outcount = min(incount, |G|)
- MPI_ERR_SIZE if incount < |G|
----------------------------

> 
>> ----------------------------
>> Seems a bit cleaner, but adds a couple new functions. I have no real preference here.
> 
> Reducing the parameters for the validate functions does seem cleaner.  If we're only going to give one count parameter in the validate functions, we should consider which value is most likely to be needed by the application.  E.g., if the app checks for failed proc and for each new failure does some internal clean up and/or clears it, then it would might be more useful to return the count of new failures, as opposed to requiring the app to keep track of the previous value or using a get_num_state function.
> 
>> What do you all think about these new interface variations?
> 
> I think the one used in your example is fine, with the possible option of changing the return count to new failures, but I can't decide which is better.

My removing the count parameter from the MPI_Comm_validate_{local|global} we get out of the business of deciding which count to return, and let the user specify it explicitly. 

The example would now expand out a bit to be:
------------
MPI_Comm_validate_global(comm);
MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_start);
/* Do work */
MPI_Comm_validate_global(comm);
MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_end);

if( num_failed_start < num_failed_end ) { /* something failed */
  incount = num_failed_end;
  rank_infos = malloc(... * incount);
  MPI_Comm_get_state_global(comm,
      MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
      incount, &outcount, rank_infos);
}
------------

Thoughts?

-- Josh

> 
> -d
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey