[Mpi3-ft] MPI_Comm_validate parameters

Fri Feb 11 15:39:32 CST 2011

The topic of the parameters to MPI_Comm_validate has come up a few times now. The point of this thread is to consolidate the discussion, and have us work towards a solution on this specific item.

Problem A:
-----------
So the core problem is that if there are 100 failures in the communicator, but the user only supplies a buffer of size 10, then incount=10 and outcount can only return a max value of 10. It would be useful for the user to know that there are more results that could have been returned.

So we could extend the interface to have a 'fullcount':
 incount: Size of the 'rank_infos' array
 outcount: Number of elements filled in the array
 fullcount: Total number of elements that could have been filled in

If 'incount' is 0, then the rank_infos array can be NULL (it is not used), outcount is not modifed, but fullcount will be set to the total number of failures in the associated communicator.

That would allow the application to call the function similar to the below:

do {
ret = MPI_Comm_validate(comm, 0, NULL, &fullcount, NULL);
if( NULL != rank_infos) { free(rank_infos); }
rank_infos = (MPI_Rank_info*)malloc(sizeof(MPI_Rank_info) * fullcount);
incount = fullcount;
ret = MPI_Comm_validate(comm, incount, &outcount, &fullcount, rank_infos);
if( outcount < fullcount ) { ret = MPI_ERR_SIZE; }
} while(ret == MPI_ERR_SIZE);

Problem B:
----------
Currently MPI_Comm_validate() will return the -full- list of -both- recognized and unrecognized failures in the communicator. It would be useful if the user had a way to control the range of the failures it is returned by this function. Of particular note is the fact that if they call MPI_Comm_validate_all() it automatically clears outstanding unrecognized failures in the communicator, and the application has to keep a separate list of previously known failed ranks in order to determine the 'newly' recognized failed processes.

It was identified that the following sets would be useful to the application:
 - All STATE_FAILED and STATE_NULL (Current setting, proposed 'default')
 - All STATE_FAILED
 - All STATE_NULL
 - ALL STATE_OK
 - ALL -new- STATE_NULL (since last call to the MPI_Comm_validate() function)
 - ALL -new- STATE_FAILED (since last call to the MPI_Comm_validate() function)

So it would be useful if the user could express the subset they wish to have returned to them. We could extend the interface with an 'info' argument that would allow the user to specify any one of the above set to the function. The MPI implementation will be required to provide the functionality or return an error MPI_ERR_KEYVAL if it cannot do so.

Problem C:
----------
It was mentioned that the scalability of the 'rank_infos' array argument may become a concern for large scale, high fault environments. The suggestion was to move the argument to be a 'group' object that can be internally managed.

Problem D:
----------
It was mentioned that there is a window of time between MPI_Comm_validate_all() and a following MPI_Comm_validate() in which a process may fail. The user will be returned the 'full' (see Problem A) set of failures in MPI_Comm_validate() which may be larger than the 'outcount' of MPI_Comm_validate_all() may indicate. The question was if users will want only the set of failures decided upon by the MPI_Comm_validate_all() call, ignoring the new failures.

It is unclear to me that this situation is useful in general, and I think that a solution to Problem A will be sufficient for many/most/all applications.

However if this is a useful scenario, then to make this happen we could leverage the example of MProbe and create an object that can be carried between the two functions. Noting that this puts an additional burden upon the MPI implementation to manage these subsets for every call to MPI_Comm_validate_all() - but if it is a 'new' object then we can insist that the user free the value when finished with it.

Alternatively, we could provide a version of MPI_Comm_validate_all() that is extended to include the parameters of MPI_Comm_validate() so that in one operation the user is provided the full list of agreed upon failed processes. There is some trouble in light of Problem B, but we may be able to find a way around it.

I think that overs what I was aware of. What do folks think about these problems and proposed solutions?

Thanks,
Josh

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey