[Mpi3-ft] MPI_Comm_validate parameters

Mon Feb 28 14:13:02 CST 2011

On Feb 28, 2011, at 1:50 PM, Joshua Hursey wrote:

> Reworked block below
> -----------------------

Looks good

>>> We have the following states (prefix with MPI_RANK_STATE_):
>>> - OK (active)
>>> - FAILED (failed, unrecognized)
>>> - NULL (failed, recognized)
>>> 
>>> We could add a few new modifiers (prefix with MPI_RANK_STATE_MOD_):
>>> - NEW (since last call to {global|local} validate)
>>> - OLD (before last call to {global|local} validate)
>>> - RECOGNIZED (-- maybe to replace the NULL state above?
>> 
>> I like this idea.
> 
> The idea of or'ing states, or the idea of having a 'Recognized' modifier, or both?

Both.  (But I prefer NULL to RECOGNIZED.)

>> 
>>> To determine "L" or "G" they would use the following functions:
>>> ----------------------------
>>> MPI_Comm_validate_local(comm, &num_failed)
>>> - Local operation
>>> - Update L
>>> - num_failed = |L| (both recognized and unrecognized)
>>> 
>>> MPI_Comm_validate_global(comm, &num_failed)
>>> - Collective operation
>>> - Update G
>>> - Update L = G
>>> - num_failed = |L| = |G|
>>> ----------------------------
>>> 
>>> 
>>> Accessors have the following properties:
>>> - These are local operations
>>> - None of them modify "L" or "G"
>>> - Take an or'ed list of states and modifiers to determine 'type'
>>> - If incount = 0, then outcount = |L| or |G|, rank_infos ignored
>>> 
>>> ----------------------------
>>> MPI_Comm_get_state_local(comm, type, incount, &outcount, rank_infos[])
>>> - Local operation
>>> - Returns the set of processes in "L" that match the 'type' specified
>>> - outcount = min(incount, |L|)
>>> - MPI_ERR_SIZE if incount != 0 and incount < |L|
>>> 
>>> MPI_Comm_get_state_global(comm, type, incount, &outcount, rank_infos[])
>>> - Local operation
>>> - Returns the set of processes in "G" that match the 'type' specified
>>> - outcount = min(incount, |G|)
>>> - MPI_ERR_SIZE if incount != 0 and incount < |G|
>>> ----------------------------
>>> 
>>> 
>>> So an application can do something like:
>>> ------------
>>> MPI_Comm_validate_global(comm, &num_failed_start);
>>> /* Do work */
>>> MPI_Comm_validate_global(comm, &num_failed_end);
>>> 
>>> if( num_failed_start < num_failed_end ) { /* something failed */
>>> incount = 0;
>>> MPI_Comm_get_state_global(comm,
>>>  MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>>  incount, &outcount, NULL);
>>> rank_infos = malloc(... * outcount);
>>> incount = outcount;
>>> MPI_Comm_get_state_global(comm,
>>>  MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>>  incount, &outcount, rank_infos);
>>> }
>>> ------------
>>> 
>>> Instead of having the 'if incount = 0' rule, we could just introduce a new function like:
>>> ----------------------------
>>> MPI_Comm_get_num_state_local(comm, type, &count);
>>> MPI_Comm_get_num_state_global(comm, type, &count);
>> 
>> In that case we can even replace num_failed in the comm_validate functions with a flag: new_failures.  Then use the above to get the counts.
> 
> Or even better, eliminate the second argument from the MPI_Comm_validate_{local|global}, and just pass the communicator to it - similar to MPI_Barrier. Since the accessor functions are always related to the last update call there is no real need (other than shorthand) to have the additional parameter.

I considered that, but that would require two calls to determine if anything failed.  Replacing the count with a are_there_new_failures flag would solve that.

> 
> My removing the count parameter from the MPI_Comm_validate_{local|global} we get out of the business of deciding which count to return, and let the user specify it explicitly. 
> 
> The example would now expand out a bit to be:
> ------------
> MPI_Comm_validate_global(comm);
> MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_start);
> /* Do work */
> MPI_Comm_validate_global(comm);
> MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_end);
> 
> if( num_failed_start < num_failed_end ) { /* something failed */
>  incount = num_failed_end;
>  rank_infos = malloc(... * incount);
>  MPI_Comm_get_state_global(comm,
>      MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>      incount, &outcount, rank_infos);
> }
> ------------

Replacing count with a flag would look like this.  So in the common, non-error case you just do a branch.  It's not so much a performance thing (validate_global is collective), but a convenience thing to the programmer.

MPI_Comm_validate_global(comm, &new_failures);
/* Do work */
MPI_Comm_validate_global(comm, &new_failures);

if( new_failures ) { /* something failed */
  MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_end);
  incount = num_failed_end;
  rank_infos = malloc(... * incount);
  MPI_Comm_get_state_global(comm,
     MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
     incount, &outcount, rank_infos);
}

Hmm.  We could combine validate and get_num_state:
    MPI_Comm_validate_global(comm, count_type, &count)
This would let the user decide what count to return.

-d