[Mpi3-ft] MPI_Comm_validate parameters
Darius Buntinas
buntinas at mcs.anl.gov
Fri Feb 25 13:58:59 CST 2011
On the last concall, we discussed the semantics of the comm validate functions, and how to query for failed processes. Josh asked me to capture these ideas and send it out to the list. Below is my attempt at this. Please let me know if I missed anything from the discussion.
Comments are encouraged.
-d
Processes' Views of Failed Processes
------------------------------------
For each communicator, with the set of processes C = {0, ...,
size_of_communicator-1}, consider three sets of failed processes:
F, the set of actually failed processes in the communicator;
L, the set of processes of the communicator known to be failed by
the local process; and
G, the set of processes of the communicator that the local process
knows that every process in the communicator knows to be failed.
Note that C >= F >= L >= G. Because the process may have an imperfect
view of the state of the system, a process may never know F, so we
allow the process to query only for L or G.
Determining the Sets of Failed Processes
----------------------------------------
MPI provides the following functions to determine L and G. [These
names can, and probably should, change.]
MPI_Comm_validate_local(comm, &num_failed, &num_new) -- This is a
local function that checks for failed processes and updates
the set L. Note that this check for failed processes may be
imperfect, in that even if no additional processes fail L
might not be the same as F. When the function returns,
num_failed = |L| and num_new will be the number of newly
discovered failed processes, i.e., if L' = L before
MPI_Comm_validate_local wass called, then num_new = |L|-|L'|.
MPI_Comm_validate_global(comm, &num_failed, &num_new) -- This is
collective. Each process in the communicator checks for
failed processes, performs an all-reduce to take the union of
all the processes known to have failed by each process, then
updates the sets G and L. After this function returns L = G.
The parameters num_failed and num_new are set the same as in
MPI_Comm_validate_local.
Note that the set L will not change outside of a call to
MPI_Comm_validate_local or MPI_Comm_validate_global, and the set G
will not change outside of a call to MPI_Comm_validate_global.
Querying the Sets of Failed Processes
-------------------------------------
MPI also provides functions to allow the user to query L and G.
MPI_Comm_get_failed_local(comm, failed_type, failed_array) -- This
is a local function. The parameter failed_type is one of
MPI_FAILED_ALL, MPI_FAILED_NEW.
When failed_type = MPI_FAILED_ALL, the set L is returned in
failed_array. The size of the array returned is the value
returned in num_failed in the last call to
MPI_Comm_validate_local or MPI_Comm_validate_global. The
caller of the function must ensure that failed_array is large
enough to hold num_failed integers.
When failed_type = MPI_FAILED_NEW, the set of newly discovered
failed processes is returned in failed_array. I.e., if L' = L
before MPI_Comm_validate_local or MPI_Comm_validate_global was
called, then failed_array contains the list of processes in
the set L - L'. It is the caller's responsibility to ensure
that the failed_array is large enough to hold num_new
integers, where num_new is the value returned the last time
MPI_Comm_validate_local or MPI_Comm_validate_global was
called.
MPI_Comm_get_failed_global(comm, failed_type, failed_array) -- This
is a local function. The parameter failed_type is one of
MPI_FAILED_ALL, MPI_FAILED_NEW.
When failed_type = MPI_FAILED_ALL the set G is retured in
failed_array. The size of the array is the value returned in
num_failed in the last call to MPI_Comm_validate_global. The
caller of the function must ensure that the failed_array is
large enough to hold num_failed integers.
When failed_type = MPI_FAILED_NEW, the set of newly discovered
failed processes is returned in failed_array. I.e., if G' = G
before MPI_Comm_validate_global was called, then failed_array
contains the list of processes in the set G - G'. It is the
caller's responsibility to ensure that the failed_array is
large enough to hold num_new integers, where num_new is the
value returned the last time MPI_Comm_validate_global was
called.
[Initially, I included a num_uncleared variable in
MPI_Comm_validate_local to return the number of uncleared processes,
but since the number of cleared processes does not change in that
function, it seemed redundant. Instead, I'm thinking we should have a
separate mechansim to query the number and set of cleared/uncleared
processes to go along with the functions for clearing a failed
process.]
Implementation Notes: Memory Requirements for Maintaining the Sets of
Failed Processes
---------------------------------------------------------------------
One set G is needed per communicator. The set G can be stored in a
distributed fashion across the processes of the communicator to reduce
the storage needed by each individual process. [What happens if some
of the processes storing the distributed array fail? We can store
them with some redundancy and tolerate K failures. We can also
recompute the set, but we'd need the old value to be able to return
the FAILED_NEW set.]
L cannot be stored in a distributed way, because each process may have
a different set L. However, since L >= G, L can be represented as the
union of G and L', so only L' needs to be stored locally. Presumably
L' would be small. Note that the implementation is free to let L' be
a subset of the processes it has connected to or tried to connect to.
So the set L' would be about as large as the number of connected
processes, and can be represented as a state variable (e.g., UP, or
DOWN) in the data structure of connected processes. Of course
pathological cases can be found, but these would require (1) that
there are a very large number of failed processes, and (2) that the
process has tried to connect to a very large number of them.
The storage requirements to keep track of the set of cleared processes
is similar to that of storing L. Since G is a subset of the set of
cleared processes which itself is a subset of L, the set of cleared
processes can be represented as an additional state in the connected
processes data structure (e.g., UP, DOWN or CLEARED). Note that after
a global validate, L = G = the set of cleared processes, so calling
the global validate function could potentially free up storage. A
note about this and the memory requirements of clearing processes
locally can be included in advice to users.
More information about the mpiwg-ft
mailing list