[Mpi3-ft] MPI_ANY_SOURCE ... again...

Thu Jan 26 09:48:26 CST 2012

I agree that the "MPI_ANY_SOURCE returns on error" semantics is better for users. However, if this is going to be a sticking point for the rest of the forum, it is not actually that difficult to fake this functionality on top of the "MPI_ANY_SOURCE blocks" semantics. Josh, you correctly pointed out that the MPI implementation should be able to leverage its own out of band failure detectors to implement the "returns" functionality but if that is the case, why can't the vendor provide an optional layer to the user that will do exactly the same thing but without messing with the MPI forum?

What I'm proposing is that vendors or any other software developers provide a failure notification layer that sends an MPI message on a pre-defined communicator to the process. They would also provide a PMPI layer that wraps MPI_Receive(MPI_ANY_SOURCE) so that it alternates between testing for the arrival of messages that match this operation and a failure notification message. If the former arrives first, the wrapper returns normally. If the latter arrives first, the original MPI_Receive(MPI_ANY_SOURCE) is cancelled and the call returns with an error. Conveniently, since the failure notifier and the PMPI layer are orthogonal, we can connect the application to any failure detector, making it possible to provide these for systems where the vendors are lazy.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Martin Schulz
> Sent: Wednesday, January 25, 2012 4:30 PM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...
> 
> Hi Josh, all,
> 
> I agree, I think the current approach is fine. Blocking is likely to be more
> problematic in many cases, IMHO. However, I am still a bit worried about
> splitting the semantics for P2P and Collective routines. I don't see a reason
> why a communicator after a collective call to validate wouldn't support
> ANY_SOURCE. If it is split, though, any intermediate layer trying to replace
> collectives with P2P solutions (and there are plenty of tuning frameworks out
> there that try exactly that) have a hard time maintaing the same semantics in
> an error case.
> 
> Martin
> 
> 
> On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:
> 
> > Hi Josh,
> >
> > Cray is okay with the semantics described in the current FTWG proposal
> > attached to the ticket.
> >
> > We plan to just leverage the out-of-band system fault detector
> > software that currently kills jobs if a node goes down that the job
> > was running on.
> >
> > Howard
> >
> > Josh Hursey wrote:
> >> We really need to make a decision on semantics for MPI_ANY_SOURCE.
> >>
> >> During the plenary session the MPI Forum had a problem with the
> >> current proposed semantics. The current proposal states (roughly)
> >> that MPI_ANY_SOURCE return when a failure emerges in the
> >> communicator. The MPI Forum read this as a strong requirement for
> >> -progress- (something the MPI standard tries to stay away from).
> >>
> >> The alternative proposal is that a receive on MPI_ANY_SOURCE will
> >> block until completed with a message. This means that it will -not-
> >> return when a new failure has been encountered (even if the calling
> >> process is the only process left alive in the communicator). This
> >> does get around the concern about progress, but puts a large burden on
> the end user.
> >>
> >>
> >> There are a couple good use cases for MPI_ANY_SOURCE (grumble,
> >> grumble)
> >> - Manager/Worker applications, and easy load balancing when multiple
> >> incoming messages are expected. This blocking behavior makes the use
> >> of MPI_ANY_SOURCE dangerous for fault tolerant applications, and
> >> opens up another opportunity for deadlock.
> >>
> >> For applications that want to use MPI_ANY_SOURCE and be fault
> >> tolerant they will need to build their own failure detector on top of
> >> MPI using directed point-to-point messages. A basic implementation
> >> might post MPI_Irecv()'s to each worker process with an unused tag,
> >> then poll on Testany(). If any of these requests complete in error
> >> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the
> >> application can take action. This user-level failure detector can
> >> (should) be implemented in a third-party library since failure
> >> detectors can be difficult to implement in a scalable manner.
> >>
> >> In reality, the MPI library or the runtime system that supports MPI
> >> will already be doing something similar. Even for
> >> MPI_ERRORS_ARE_FATAL on MPI_COMM_WORLD, the underlying system
> must
> >> detect the process failure, and terminate all other processes in
> >> MPI_COMM_WORLD. So this represents a -detection- of the failure, and
> >> a -notification- of the failure throughout the system (though the
> >> notification is an order to terminate). For MPI_ERRORS_RETURN, the
> >> MPI will use this detection/notification functionality to reason
> >> about the state of the message traffic in the system. So it seems
> >> silly to force the user to duplicate this (nontrivial)
> >> detection/notification functionality on top of MPI, just to avoid the
> progress discussion.
> >>
> >>
> >> So that is a rough summary of the debate. If we are going to move
> >> forward, we need to make a decision on MPI_ANY_SOURCE. I would like
> >> to make such a decision before/during the next teleconf (Feb. 1).
> >>
> >> I'm torn on this one, so I look forward to your comments.
> >>
> >> -- Josh
> >>
> >> --
> >> Joshua Hursey
> >> Postdoctoral Research Associate
> >> Oak Ridge National Laboratory
> >> http://users.nccs.gov/~jjhursey
> >>
> >
> >
> > --
> > Howard Pritchard
> > Software Engineering
> > Cray, Inc.
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> ______________________________________________________________
> __________
> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm CASC @
> Lawrence Livermore National Laboratory, Livermore, USA
> 
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft