[Mpi3-ft] MPI_ANY_SOURCE ... again...

Josh Hursey jjhursey at open-mpi.org
Thu Jan 26 09:20:37 CST 2012

If we were to change the MPI_ANY_SOURCE semantics to only return when
completed/matched (so the suggested proposal modification), would this make
life more difficult on an intermediate layer virtualizing the collective

I am open to the idea that calling a collective validate() also re-enables
the posting of MPI_ANY_SOURCE. The reason we did not do that in the current
proposal is because we were uncertain if someone would want to do a
validate, but not re-enable the posting of MPI_ANY_SOURCE receives. Since
reenabling ANY_SOURCE is a local/quick operation there was not performance
argument to be made. Though I think you pose an interesting programability

So can you elaborate a bit on your example (I just want to make sure I
fully understand)? Would such an intermediate layer be using MPI_ANY_SOURCE
in their collective operations and depend on the
return-when-new-proc-failure semantic? So the intermediate library would
have to reenable ANY_SOURCE, then mask the semantics for user initiated p2p
operations over the same communicator. If we added the reenable_any_source
semantics with the validate then the intermediate library would not have to
virtualize the p2p communication. Is that on the right track?


On Wed, Jan 25, 2012 at 7:29 PM, Martin Schulz <schulzm at llnl.gov> wrote:

> Hi Josh, all,
> I agree, I think the current approach is fine. Blocking is likely to be
> more problematic in many cases, IMHO. However, I am still a bit worried
> about splitting the semantics for P2P and Collective routines. I don't see
> a reason why a communicator after a collective call to validate wouldn't
> support ANY_SOURCE. If it is split, though, any intermediate layer trying
> to replace collectives with P2P solutions (and there are plenty of tuning
> frameworks out there that try exactly that) have a hard time maintaing the
> same semantics in an error case.
> Martin
> On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:
> > Hi Josh,
> >
> > Cray is okay with the semantics described in the current
> > FTWG proposal attached to the ticket.
> >
> > We plan to just leverage the out-of-band system fault
> > detector software that currently kills jobs if
> > a node goes down that the job was running on.
> >
> > Howard
> >
> > Josh Hursey wrote:
> >> We really need to make a decision on semantics for MPI_ANY_SOURCE.
> >>
> >> During the plenary session the MPI Forum had a problem with the current
> >> proposed semantics. The current proposal states (roughly) that
> >> MPI_ANY_SOURCE return when a failure emerges in the communicator. The
> >> MPI Forum read this as a strong requirement for -progress- (something
> >> the MPI standard tries to stay away from).
> >>
> >> The alternative proposal is that a receive on MPI_ANY_SOURCE will block
> >> until completed with a message. This means that it will -not- return
> >> when a new failure has been encountered (even if the calling process is
> >> the only process left alive in the communicator). This does get around
> >> the concern about progress, but puts a large burden on the end user.
> >>
> >>
> >> There are a couple good use cases for MPI_ANY_SOURCE (grumble, grumble)
> >> - Manager/Worker applications, and easy load balancing when
> >> multiple incoming messages are expected. This blocking behavior makes
> >> the use of MPI_ANY_SOURCE dangerous for fault tolerant applications, and
> >> opens up another opportunity for deadlock.
> >>
> >> For applications that want to use MPI_ANY_SOURCE and be fault tolerant
> >> they will need to build their own failure detector on top of MPI using
> >> directed point-to-point messages. A basic implementation might post
> >> MPI_Irecv()'s to each worker process with an unused tag, then poll on
> >> Testany(). If any of these requests complete in error
> >> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the application
> >> can take action. This user-level failure detector can (should) be
> >> implemented in a third-party library since failure detectors can be
> >> difficult to implement in a scalable manner.
> >>
> >> In reality, the MPI library or the runtime system that supports MPI will
> >> already be doing something similar. Even for MPI_ERRORS_ARE_FATAL on
> >> MPI_COMM_WORLD, the underlying system must detect the process failure,
> >> and terminate all other processes in MPI_COMM_WORLD. So this represents
> >> a -detection- of the failure, and a -notification- of the failure
> >> throughout the system (though the notification is an order to
> >> terminate). For MPI_ERRORS_RETURN, the MPI will use this
> >> detection/notification functionality to reason about the state of the
> >> message traffic in the system. So it seems silly to force the user to
> >> duplicate this (nontrivial) detection/notification functionality on top
> >> of MPI, just to avoid the progress discussion.
> >>
> >>
> >> So that is a rough summary of the debate. If we are going to move
> >> forward, we need to make a decision on MPI_ANY_SOURCE. I would like to
> >> make such a decision before/during the next teleconf (Feb. 1).
> >>
> >> I'm torn on this one, so I look forward to your comments.
> >>
> >> -- Josh
> >>
> >> --
> >> Joshua Hursey
> >> Postdoctoral Research Associate
> >> Oak Ridge National Laboratory
> >> http://users.nccs.gov/~jjhursey
> >>
> >
> >
> > --
> > Howard Pritchard
> > Software Engineering
> > Cray, Inc.
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> ________________________________________________________________________
> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120126/96a0d503/attachment-0001.html>

More information about the mpiwg-ft mailing list