[Mpi3-ft] MPI_ANY_SOURCE ... again...

Josh Hursey jjhursey at open-mpi.org
Thu Jan 26 09:20:37 CST 2012


If we were to change the MPI_ANY_SOURCE semantics to only return when
completed/matched (so the suggested proposal modification), would this make
life more difficult on an intermediate layer virtualizing the collective
operations?

I am open to the idea that calling a collective validate() also re-enables
the posting of MPI_ANY_SOURCE. The reason we did not do that in the current
proposal is because we were uncertain if someone would want to do a
validate, but not re-enable the posting of MPI_ANY_SOURCE receives. Since
reenabling ANY_SOURCE is a local/quick operation there was not performance
argument to be made. Though I think you pose an interesting programability
argument.

So can you elaborate a bit on your example (I just want to make sure I
fully understand)? Would such an intermediate layer be using MPI_ANY_SOURCE
in their collective operations and depend on the
return-when-new-proc-failure semantic? So the intermediate library would
have to reenable ANY_SOURCE, then mask the semantics for user initiated p2p
operations over the same communicator. If we added the reenable_any_source
semantics with the validate then the intermediate library would not have to
virtualize the p2p communication. Is that on the right track?

Thanks,
Josh

On Wed, Jan 25, 2012 at 7:29 PM, Martin Schulz <schulzm at llnl.gov> wrote:

> Hi Josh, all,
>
> I agree, I think the current approach is fine. Blocking is likely to be
> more problematic in many cases, IMHO. However, I am still a bit worried
> about splitting the semantics for P2P and Collective routines. I don't see
> a reason why a communicator after a collective call to validate wouldn't
> support ANY_SOURCE. If it is split, though, any intermediate layer trying
> to replace collectives with P2P solutions (and there are plenty of tuning
> frameworks out there that try exactly that) have a hard time maintaing the
> same semantics in an error case.
>
> Martin
>
>
> On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:
>
> > Hi Josh,
> >
> > Cray is okay with the semantics described in the current
> > FTWG proposal attached to the ticket.
> >
> > We plan to just leverage the out-of-band system fault
> > detector software that currently kills jobs if
> > a node goes down that the job was running on.
> >
> > Howard
> >
> > Josh Hursey wrote:
> >> We really need to make a decision on semantics for MPI_ANY_SOURCE.
> >>
> >> During the plenary session the MPI Forum had a problem with the current
> >> proposed semantics. The current proposal states (roughly) that
> >> MPI_ANY_SOURCE return when a failure emerges in the communicator. The
> >> MPI Forum read this as a strong requirement for -progress- (something
> >> the MPI standard tries to stay away from).
> >>
> >> The alternative proposal is that a receive on MPI_ANY_SOURCE will block
> >> until completed with a message. This means that it will -not- return
> >> when a new failure has been encountered (even if the calling process is
> >> the only process left alive in the communicator). This does get around
> >> the concern about progress, but puts a large burden on the end user.
> >>
> >>
> >> There are a couple good use cases for MPI_ANY_SOURCE (grumble, grumble)
> >> - Manager/Worker applications, and easy load balancing when
> >> multiple incoming messages are expected. This blocking behavior makes
> >> the use of MPI_ANY_SOURCE dangerous for fault tolerant applications, and
> >> opens up another opportunity for deadlock.
> >>
> >> For applications that want to use MPI_ANY_SOURCE and be fault tolerant
> >> they will need to build their own failure detector on top of MPI using
> >> directed point-to-point messages. A basic implementation might post
> >> MPI_Irecv()'s to each worker process with an unused tag, then poll on
> >> Testany(). If any of these requests complete in error
> >> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the application
> >> can take action. This user-level failure detector can (should) be
> >> implemented in a third-party library since failure detectors can be
> >> difficult to implement in a scalable manner.
> >>
> >> In reality, the MPI library or the runtime system that supports MPI will
> >> already be doing something similar. Even for MPI_ERRORS_ARE_FATAL on
> >> MPI_COMM_WORLD, the underlying system must detect the process failure,
> >> and terminate all other processes in MPI_COMM_WORLD. So this represents
> >> a -detection- of the failure, and a -notification- of the failure
> >> throughout the system (though the notification is an order to
> >> terminate). For MPI_ERRORS_RETURN, the MPI will use this
> >> detection/notification functionality to reason about the state of the
> >> message traffic in the system. So it seems silly to force the user to
> >> duplicate this (nontrivial) detection/notification functionality on top
> >> of MPI, just to avoid the progress discussion.
> >>
> >>
> >> So that is a rough summary of the debate. If we are going to move
> >> forward, we need to make a decision on MPI_ANY_SOURCE. I would like to
> >> make such a decision before/during the next teleconf (Feb. 1).
> >>
> >> I'm torn on this one, so I look forward to your comments.
> >>
> >> -- Josh
> >>
> >> --
> >> Joshua Hursey
> >> Postdoctoral Research Associate
> >> Oak Ridge National Laboratory
> >> http://users.nccs.gov/~jjhursey
> >>
> >
> >
> > --
> > Howard Pritchard
> > Software Engineering
> > Cray, Inc.
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> ________________________________________________________________________
> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
>
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>


-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120126/96a0d503/attachment-0001.html>


More information about the mpiwg-ft mailing list