[Mpi3-ft] New version of the RTS proposal
buntinas at mcs.anl.gov
Tue Nov 8 14:03:04 CST 2011
I saw this sentence in the proposal:
\MPI/ guarantees that eventually all processes in the \MPI/
universe will become aware of all process failures.
I think we only want to say all processes will become aware of all
failures of _connected_ processes. Two MPI jobs can technically
belong to the same universe, but they don't need to be aware of each
other's failures unless they're connected. Besides, unless they were
connected, there would be no way to identify the failed processes of
one job to the processes of the other job.
As an aside, I did some grepping for "universe" in the standard, and
found that universe is actually not well defined. It can refer to
communication within a communicator (p29), an implementation defined
scope of a port name (p320) and an implementation defined set of
processes that can communicate with each other. Then there's
MPI_UNIVERSE_SIZE which is the number of total processes that "can
usefully be started," (p308) which implies the definition of universe
that I think most people use, namely, the set of all potential
processes that some process can be made aware of / be connected to.
More information about the mpiwg-ft