[Mpi3-ft] New version of the RTS proposal

Josh Hursey jjhursey at open-mpi.org
Tue Nov 8 14:47:53 CST 2011

On Tue, Nov 8, 2011 at 3:03 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> I saw this sentence in the proposal:
>    \MPI/ guarantees that eventually all processes in the \MPI/
>    universe will become aware of all process failures.
> I think we only want to say all processes will become aware of all
> failures of _connected_ processes.  Two MPI jobs can technically
> belong to the same universe, but they don't need to be aware of each
> other's failures unless they're connected.  Besides, unless they were
> connected, there would be no way to identify the failed processes of
> one job to the processes of the other job.

I agree that adding _connected_ here helps clarify this statement.

> As an aside, I did some grepping for "universe" in the standard, and
> found that universe is actually not well defined.  It can refer to
> communication within a communicator (p29), an implementation defined
> scope of a port name (p320) and an implementation defined set of
> processes that can communicate with each other.  Then there's
> MPI_UNIVERSE_SIZE which is the number of total processes that "can
> usefully be started," (p308) which implies the definition of universe
> that I think most people use, namely, the set of all potential
> processes that some process can be made aware of / be connected to.

Humm... This is an interesting point. I think we are ok using universe
in the context of all processes that a single process may be connected

In terms of the failure detector statement, that would mean that MPI
provides a process notification of process failure:
 - in it's own MPI_COMM_WORLD
 - in a remote group when the process joins with it (via connect/accept)
But not provide the process notification of process failure if it is
not connected to it (2 disjoint spawned groups).

So at the point where an inter-communicator is created between two
disjoint MPI_COMM_WORLDs then they must share failure information with
one another regarding the local/remote groups. But if the two disjoin
MPI_COMM_WORLDs never collide then they need not know about one

What do you think?

-- Josh

> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory

More information about the mpiwg-ft mailing list