[Mpi3-ft] New version of the RTS proposal
jjhursey at open-mpi.org
Tue Nov 8 14:47:53 CST 2011
On Tue, Nov 8, 2011 at 3:03 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> I saw this sentence in the proposal:
> \MPI/ guarantees that eventually all processes in the \MPI/
> universe will become aware of all process failures.
> I think we only want to say all processes will become aware of all
> failures of _connected_ processes. Two MPI jobs can technically
> belong to the same universe, but they don't need to be aware of each
> other's failures unless they're connected. Besides, unless they were
> connected, there would be no way to identify the failed processes of
> one job to the processes of the other job.
I agree that adding _connected_ here helps clarify this statement.
> As an aside, I did some grepping for "universe" in the standard, and
> found that universe is actually not well defined. It can refer to
> communication within a communicator (p29), an implementation defined
> scope of a port name (p320) and an implementation defined set of
> processes that can communicate with each other. Then there's
> MPI_UNIVERSE_SIZE which is the number of total processes that "can
> usefully be started," (p308) which implies the definition of universe
> that I think most people use, namely, the set of all potential
> processes that some process can be made aware of / be connected to.
Humm... This is an interesting point. I think we are ok using universe
in the context of all processes that a single process may be connected
In terms of the failure detector statement, that would mean that MPI
provides a process notification of process failure:
- in it's own MPI_COMM_WORLD
- in a remote group when the process joins with it (via connect/accept)
But not provide the process notification of process failure if it is
not connected to it (2 disjoint spawned groups).
So at the point where an inter-communicator is created between two
disjoint MPI_COMM_WORLDs then they must share failure information with
one another regarding the local/remote groups. But if the two disjoin
MPI_COMM_WORLDs never collide then they need not know about one
What do you think?
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft