[Mpi3-ft] MPI_Comm_validate parameters
Darius Buntinas
buntinas at mcs.anl.gov
Mon Feb 28 14:13:02 CST 2011
On Feb 28, 2011, at 1:50 PM, Joshua Hursey wrote:
> Reworked block below
> -----------------------
Looks good
>>> We have the following states (prefix with MPI_RANK_STATE_):
>>> - OK (active)
>>> - FAILED (failed, unrecognized)
>>> - NULL (failed, recognized)
>>>
>>> We could add a few new modifiers (prefix with MPI_RANK_STATE_MOD_):
>>> - NEW (since last call to {global|local} validate)
>>> - OLD (before last call to {global|local} validate)
>>> - RECOGNIZED (-- maybe to replace the NULL state above?
>>
>> I like this idea.
>
> The idea of or'ing states, or the idea of having a 'Recognized' modifier, or both?
Both. (But I prefer NULL to RECOGNIZED.)
>>
>>> To determine "L" or "G" they would use the following functions:
>>> ----------------------------
>>> MPI_Comm_validate_local(comm, &num_failed)
>>> - Local operation
>>> - Update L
>>> - num_failed = |L| (both recognized and unrecognized)
>>>
>>> MPI_Comm_validate_global(comm, &num_failed)
>>> - Collective operation
>>> - Update G
>>> - Update L = G
>>> - num_failed = |L| = |G|
>>> ----------------------------
>>>
>>>
>>> Accessors have the following properties:
>>> - These are local operations
>>> - None of them modify "L" or "G"
>>> - Take an or'ed list of states and modifiers to determine 'type'
>>> - If incount = 0, then outcount = |L| or |G|, rank_infos ignored
>>>
>>> ----------------------------
>>> MPI_Comm_get_state_local(comm, type, incount, &outcount, rank_infos[])
>>> - Local operation
>>> - Returns the set of processes in "L" that match the 'type' specified
>>> - outcount = min(incount, |L|)
>>> - MPI_ERR_SIZE if incount != 0 and incount < |L|
>>>
>>> MPI_Comm_get_state_global(comm, type, incount, &outcount, rank_infos[])
>>> - Local operation
>>> - Returns the set of processes in "G" that match the 'type' specified
>>> - outcount = min(incount, |G|)
>>> - MPI_ERR_SIZE if incount != 0 and incount < |G|
>>> ----------------------------
>>>
>>>
>>> So an application can do something like:
>>> ------------
>>> MPI_Comm_validate_global(comm, &num_failed_start);
>>> /* Do work */
>>> MPI_Comm_validate_global(comm, &num_failed_end);
>>>
>>> if( num_failed_start < num_failed_end ) { /* something failed */
>>> incount = 0;
>>> MPI_Comm_get_state_global(comm,
>>> MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>> incount, &outcount, NULL);
>>> rank_infos = malloc(... * outcount);
>>> incount = outcount;
>>> MPI_Comm_get_state_global(comm,
>>> MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
>>> incount, &outcount, rank_infos);
>>> }
>>> ------------
>>>
>>> Instead of having the 'if incount = 0' rule, we could just introduce a new function like:
>>> ----------------------------
>>> MPI_Comm_get_num_state_local(comm, type, &count);
>>> MPI_Comm_get_num_state_global(comm, type, &count);
>>
>> In that case we can even replace num_failed in the comm_validate functions with a flag: new_failures. Then use the above to get the counts.
>
> Or even better, eliminate the second argument from the MPI_Comm_validate_{local|global}, and just pass the communicator to it - similar to MPI_Barrier. Since the accessor functions are always related to the last update call there is no real need (other than shorthand) to have the additional parameter.
I considered that, but that would require two calls to determine if anything failed. Replacing the count with a are_there_new_failures flag would solve that.
>
> My removing the count parameter from the MPI_Comm_validate_{local|global} we get out of the business of deciding which count to return, and let the user specify it explicitly.
>
> The example would now expand out a bit to be:
> ------------
> MPI_Comm_validate_global(comm);
> MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_start);
> /* Do work */
> MPI_Comm_validate_global(comm);
> MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_end);
>
> if( num_failed_start < num_failed_end ) { /* something failed */
> incount = num_failed_end;
> rank_infos = malloc(... * incount);
> MPI_Comm_get_state_global(comm,
> MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
> incount, &outcount, rank_infos);
> }
> ------------
Replacing count with a flag would look like this. So in the common, non-error case you just do a branch. It's not so much a performance thing (validate_global is collective), but a convenience thing to the programmer.
MPI_Comm_validate_global(comm, &new_failures);
/* Do work */
MPI_Comm_validate_global(comm, &new_failures);
if( new_failures ) { /* something failed */
MPI_Comm_get_num_state_global(comm, STATE_NULL|MOD_NEW, &num_failed_end);
incount = num_failed_end;
rank_infos = malloc(... * incount);
MPI_Comm_get_state_global(comm,
MPI_RANK_STATE_NULL|MPI_RANK_STATE_MOD_NEW,
incount, &outcount, rank_infos);
}
Hmm. We could combine validate and get_num_state:
MPI_Comm_validate_global(comm, count_type, &count)
This would let the user decide what count to return.
-d
More information about the mpiwg-ft
mailing list