[Mpi3-ft] MPI_Comm_validate parameters

Fri Feb 25 13:58:59 CST 2011

On the last concall, we discussed the semantics of the comm validate functions, and how to query for failed processes.  Josh asked me to capture these ideas and send it out to the list.  Below is my attempt at this.  Please let me know if I missed anything from the discussion.

Comments are encouraged.

-d

Processes' Views of Failed Processes
------------------------------------

For each communicator, with the set of processes C = {0, ...,
size_of_communicator-1}, consider three sets of failed processes: 

    F, the set of actually failed processes in the communicator; 

    L, the set of processes of the communicator known to be failed by
    the local process; and 

    G, the set of processes of the communicator that the local process
    knows that every process in the communicator knows to be failed.

Note that C >= F >= L >= G.  Because the process may have an imperfect
view of the state of the system, a process may never know F, so we
allow the process to query only for L or G.

Determining the Sets of Failed Processes
----------------------------------------

MPI provides the following functions to determine L and G.  [These
names can, and probably should, change.]

    MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
        local function that checks for failed processes and updates
        the set L.  Note that this check for failed processes may be
        imperfect, in that even if no additional processes fail L
        might not be the same as F.  When the function returns,
        num_failed = |L| and num_new will be the number of newly
        discovered failed processes, i.e., if L' = L before
        MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.

    MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
        collective.  Each process in the communicator checks for
        failed processes, performs an all-reduce to take the union of
        all the processes known to have failed by each process, then
        updates the sets G and L.  After this function returns L = G.
        The parameters num_failed and num_new are set the same as in
        MPI_Comm_validate_local.

Note that the set L will not change outside of a call to
MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
will not change outside of a call to MPI_Comm_validate_global.

Querying the Sets of Failed Processes
-------------------------------------

MPI also provides functions to allow the user to query L and G.

    MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
        is a local function.  The parameter failed_type is one of
        MPI_FAILED_ALL, MPI_FAILED_NEW.

        When failed_type = MPI_FAILED_ALL, the set L is returned in
        failed_array.  The size of the array returned is the value
        returned in num_failed in the last call to
        MPI_Comm_validate_local or MPI_Comm_validate_global.  The
        caller of the function must ensure that failed_array is large
        enough to hold num_failed integers.

	When failed_type = MPI_FAILED_NEW, the set of newly discovered
	failed processes is returned in failed_array.  I.e., if L' = L
	before MPI_Comm_validate_local or MPI_Comm_validate_global was
	called, then failed_array contains the list of processes in
	the set L - L'.  It is the caller's responsibility to ensure
	that the failed_array is large enough to hold num_new
	integers, where num_new is the value returned the last time
	MPI_Comm_validate_local or MPI_Comm_validate_global was
	called.

    MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
        is a local function.  The parameter failed_type is one of
        MPI_FAILED_ALL, MPI_FAILED_NEW.

        When failed_type = MPI_FAILED_ALL the set G is retured in
        failed_array.  The size of the array is the value returned in
        num_failed in the last call to MPI_Comm_validate_global.  The
        caller of the function must ensure that the failed_array is
        large enough to hold num_failed integers.

	When failed_type = MPI_FAILED_NEW, the set of newly discovered
	failed processes is returned in failed_array.  I.e., if G' = G
	before MPI_Comm_validate_global was called, then failed_array
	contains the list of processes in the set G - G'.  It is the
	caller's responsibility to ensure that the failed_array is
	large enough to hold num_new integers, where num_new is the
	value returned the last time MPI_Comm_validate_global was
	called.

[Initially, I included a num_uncleared variable in
MPI_Comm_validate_local to return the number of uncleared processes,
but since the number of cleared processes does not change in that
function, it seemed redundant.  Instead, I'm thinking we should have a
separate mechansim to query the number and set of cleared/uncleared
processes to go along with the functions for clearing a failed
process.]

Implementation Notes: Memory Requirements for Maintaining the Sets of
Failed Processes
---------------------------------------------------------------------

One set G is needed per communicator.  The set G can be stored in a
distributed fashion across the processes of the communicator to reduce
the storage needed by each individual process.  [What happens if some
of the processes storing the distributed array fail?  We can store
them with some redundancy and tolerate K failures.  We can also
recompute the set, but we'd need the old value to be able to return
the FAILED_NEW set.]

L cannot be stored in a distributed way, because each process may have
a different set L.  However, since L >= G, L can be represented as the
union of G and L', so only L' needs to be stored locally.  Presumably
L' would be small.  Note that the implementation is free to let L' be
a subset of the processes it has connected to or tried to connect to.
So the set L' would be about as large as the number of connected
processes, and can be represented as a state variable (e.g., UP, or
DOWN) in the data structure of connected processes.  Of course
pathological cases can be found, but these would require (1) that
there are a very large number of failed processes, and (2) that the
process has tried to connect to a very large number of them.

The storage requirements to keep track of the set of cleared processes
is similar to that of storing L.  Since G is a subset of the set of
cleared processes which itself is a subset of L, the set of cleared
processes can be represented as an additional state in the connected
processes data structure (e.g., UP, DOWN or CLEARED).  Note that after
a global validate, L = G = the set of cleared processes, so calling
the global validate function could potentially free up storage.  A
note about this and the memory requirements of clearing processes
locally can be included in advice to users.