[Mpi3-ft] MPI_ANY_SOURCE ... again...

Thu Jan 26 15:27:40 CST 2012

Your point is well taken. But I do not agree with your assertion that the situation is "unrecoverable". For example, the app/lib using blocking recv could use another thread to post a self send and satisfy that blocking receive. I know this is not elegant, but still recoverable, right? :-)

I also disagree with your statement that this does not have impact on implementation cost. This specifically requires MPI library to have sustained interaction with process management subsystem to keep polling/watching the entire system to see where the failure occurred. It needs to do so at some pre-determined frequency, which may or may not be in the granularity the application requires. In the case where Recv(ANY) is posted on a sub communicator, you would also need to convey the participants of the sub communicator to the out-of-band system, which means you need to have the layout available. Another alternative is to send info on process failures to -everyone- and then have the MPI library pick the process to deliver this error to. IMHO, this is significant implementation complexity - not saying it can't be done TODAY, but what about at scale few years from now?

This is after all a compromise - if we (app writers + MPI community) can live with something less than what is ideal, should we go for it?

Thanks.

===
Sayantan Sur, Ph.D.
Intel Corp.

> -----Original Message-----
> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
> Sent: Thursday, January 26, 2012 10:40 AM
> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...
> 
> In my opinion, a fault tolerant proposal in which a  valid MPI program
> deadlocks, without a remedy, in case of failure, does not address the issue,
> and is worthless. I do not support the proposition that ANY_TAG should just
> not work. -Every- function of the MPI standard should have a well defined
> behavior in case of failure, that do not result in the application being in a
> unrecoverable state.
> 
> Moreover, the cost of handling any-source in inconsequent on
> implementation performance. The opposition are only of intellectual
> prettiness order; not implementation cost issues. This is not a good reason to
> drop functionality to the point where the proposal does not even tolerate
> failures of valid MPI applications anymore.
> 
> 
> Aurelien
> 
> 
> Le 26 janv. 2012 à 11:24, Bronevetsky, Greg a écrit :
> 
> > Actually, I think that we can be backwards compatible. Define semantics to
> say that we don't guarantee that MPI_ANY_SOURCE will get unblocked due
> to failure but don't preclude this possibility.
> >
> > I think I'm coming down on the blocks side as well but for a less pessimistic
> reason than Josh. Standards bodies are conservative for a reason: mistakes in
> standards are expensive. As such, if there is any feature that can be
> evaluated outside the standard before being included in the standard, then
> this is the preferable path. MPI_ANY_SOURCE returns is exactly such a
> feature. Sure, users will be harmed in the short term but if this is not quite
> the best semantics, then they'll be harmed in the long term.
> >
> > As such, lets go for the most barebones spec we can come up with on top
> of which we can implement all the other functionality we think is important.
> This gives us the flexibility to try out several APIs and decide on which is best
> before we come back before the forum to standardize the full MPI-FT API. At
> that point in time we'll have done a significantly stronger evaluation, which
> will make it much more difficult for the forum to say no, even though the list
> of features will be significantly more extensive in that proposal.
> >
> > Greg Bronevetsky
> > Lawrence Livermore National Lab
> > (925) 424-5756
> > bronevetsky at llnl.gov
> > http://greg.bronevetsky.com
> >
> > From: mpi3-ft-bounces at lists.mpi-forum.org
> > [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf OfJosh Hursey
> > Sent: Thursday, January 26, 2012 8:11 AM
> > To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> > Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...
> >
> > I mentioned something similar to folks here yesterday. If we decide on the
> "MPI_ANY_SOURCE blocks" semantics then I would expose the failure
> notification through an Open MPI specific API (or message in a special
> context, as Greg suggests), and the third-party library could short circuit their
> implementation of the detector/notifier with this call. I would then strongly
> advocate that other MPI implementations do the same.
> >
> > I think the distasteful aspect of the "MPI_ANY_SOURCE blocks" path is that
> if we are going to advocate that MPI implementations provide this type of
> interface anyway then why can we not just expose these semantics in the
> standard and be done with it. With the proposed work around, applications
> will be programming to the MPI standard and the semantics of this third-
> party library for MPI_ANY_SOURCE. So it just seems like we are working
> around a stalemate argument in the forum and the users end up suffering.
> However, if we stick with the "MPI_ANY_SOURCE returns on error"
> semantics we would have to thread the needle of the progress discussion
> (possible, but it may take quite a long time).
> >
> > So this is where I'm torn. If there was a backwards compatible path
> between the "MPI_ANY_SOURCE blocks" to the "MPI_ANY_SOURCE returns
> on error" then it would be easier to go with the former and propose the
> latter in a separate ticket. Then have the progress discussion over the
> separate ticket, and not the general RTS proposal. Maybe that path is
> defining a new MPI_ANY_SORCE_RETURN_ON_PROC_FAIL wildcard.
> >
> > I am, unfortunately, starting to lean towards the "MPI_ANY_SOURCE
> blocks" camp. I say 'unfortunately' because it hurts users, and that's not
> something we should be considering in my opinion...
> >
> > Good comments, keep them coming.
> >
> > -- Josh
> >
> > On Thu, Jan 26, 2012 at 10:48 AM, Bronevetsky, Greg
> <bronevetsky1 at llnl.gov> wrote:
> > I agree that the "MPI_ANY_SOURCE returns on error" semantics is better
> for users. However, if this is going to be a sticking point for the rest of the
> forum, it is not actually that difficult to fake this functionality on top of the
> "MPI_ANY_SOURCE blocks" semantics. Josh, you correctly pointed out that
> the MPI implementation should be able to leverage its own out of band
> failure detectors to implement the "returns" functionality but if that is the
> case, why can't the vendor provide an optional layer to the user that will do
> exactly the same thing but without messing with the MPI forum?
> >
> > What I'm proposing is that vendors or any other software developers
> provide a failure notification layer that sends an MPI message on a pre-
> defined communicator to the process. They would also provide a PMPI layer
> that wraps MPI_Receive(MPI_ANY_SOURCE) so that it alternates between
> testing for the arrival of messages that match this operation and a failure
> notification message. If the former arrives first, the wrapper returns
> normally. If the latter arrives first, the original
> MPI_Receive(MPI_ANY_SOURCE) is cancelled and the call returns with an
> error. Conveniently, since the failure notifier and the PMPI layer are
> orthogonal, we can connect the application to any failure detector, making it
> possible to provide these for systems where the vendors are lazy.
> >
> > Greg Bronevetsky
> > Lawrence Livermore National Lab
> > (925) 424-5756
> > bronevetsky at llnl.gov
> > http://greg.bronevetsky.com
> >
> >
> > > -----Original Message-----
> > > From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-
> > > bounces at lists.mpi-forum.org] On Behalf Of Martin Schulz
> > > Sent: Wednesday, January 25, 2012 4:30 PM
> > > To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> > > Group
> > > Subject: Re: [Mpi3-ft] MPI_ANY_SOURCE ... again...
> > >
> > > Hi Josh, all,
> > >
> > > I agree, I think the current approach is fine. Blocking is likely to
> > > be more problematic in many cases, IMHO. However, I am still a bit
> > > worried about splitting the semantics for P2P and Collective
> > > routines. I don't see a reason why a communicator after a collective
> > > call to validate wouldn't support ANY_SOURCE. If it is split,
> > > though, any intermediate layer trying to replace collectives with
> > > P2P solutions (and there are plenty of tuning frameworks out there
> > > that try exactly that) have a hard time maintaing the same semantics in an
> error case.
> > >
> > > Martin
> > >
> > >
> > > On Jan 25, 2012, at 3:06 PM, Howard Pritchard wrote:
> > >
> > > > Hi Josh,
> > > >
> > > > Cray is okay with the semantics described in the current FTWG
> > > > proposal attached to the ticket.
> > > >
> > > > We plan to just leverage the out-of-band system fault detector
> > > > software that currently kills jobs if a node goes down that the
> > > > job was running on.
> > > >
> > > > Howard
> > > >
> > > > Josh Hursey wrote:
> > > >> We really need to make a decision on semantics for
> MPI_ANY_SOURCE.
> > > >>
> > > >> During the plenary session the MPI Forum had a problem with the
> > > >> current proposed semantics. The current proposal states (roughly)
> > > >> that MPI_ANY_SOURCE return when a failure emerges in the
> > > >> communicator. The MPI Forum read this as a strong requirement for
> > > >> -progress- (something the MPI standard tries to stay away from).
> > > >>
> > > >> The alternative proposal is that a receive on MPI_ANY_SOURCE will
> > > >> block until completed with a message. This means that it will
> > > >> -not- return when a new failure has been encountered (even if the
> > > >> calling process is the only process left alive in the
> > > >> communicator). This does get around the concern about progress,
> > > >> but puts a large burden on
> > > the end user.
> > > >>
> > > >>
> > > >> There are a couple good use cases for MPI_ANY_SOURCE (grumble,
> > > >> grumble)
> > > >> - Manager/Worker applications, and easy load balancing when
> > > >> multiple incoming messages are expected. This blocking behavior
> > > >> makes the use of MPI_ANY_SOURCE dangerous for fault tolerant
> > > >> applications, and opens up another opportunity for deadlock.
> > > >>
> > > >> For applications that want to use MPI_ANY_SOURCE and be fault
> > > >> tolerant they will need to build their own failure detector on
> > > >> top of MPI using directed point-to-point messages. A basic
> > > >> implementation might post MPI_Irecv()'s to each worker process
> > > >> with an unused tag, then poll on Testany(). If any of these
> > > >> requests complete in error
> > > >> (MPI_ERR_PROC_FAIL_STOP) then the target has failed and the
> > > >> application can take action. This user-level failure detector can
> > > >> (should) be implemented in a third-party library since failure
> > > >> detectors can be difficult to implement in a scalable manner.
> > > >>
> > > >> In reality, the MPI library or the runtime system that supports
> > > >> MPI will already be doing something similar. Even for
> > > >> MPI_ERRORS_ARE_FATAL on MPI_COMM_WORLD, the underlying
> system
> > > must
> > > >> detect the process failure, and terminate all other processes in
> > > >> MPI_COMM_WORLD. So this represents a -detection- of the failure,
> > > >> and a -notification- of the failure throughout the system (though
> > > >> the notification is an order to terminate). For
> > > >> MPI_ERRORS_RETURN, the MPI will use this detection/notification
> > > >> functionality to reason about the state of the message traffic in
> > > >> the system. So it seems silly to force the user to duplicate this
> > > >> (nontrivial) detection/notification functionality on top of MPI,
> > > >> just to avoid the
> > > progress discussion.
> > > >>
> > > >>
> > > >> So that is a rough summary of the debate. If we are going to move
> > > >> forward, we need to make a decision on MPI_ANY_SOURCE. I would
> > > >> like to make such a decision before/during the next teleconf (Feb. 1).
> > > >>
> > > >> I'm torn on this one, so I look forward to your comments.
> > > >>
> > > >> -- Josh
> > > >>
> > > >> --
> > > >> Joshua Hursey
> > > >> Postdoctoral Research Associate
> > > >> Oak Ridge National Laboratory
> > > >> http://users.nccs.gov/~jjhursey
> > > >>
> > > >
> > > >
> > > > --
> > > > Howard Pritchard
> > > > Software Engineering
> > > > Cray, Inc.
> > > > _______________________________________________
> > > > mpi3-ft mailing list
> > > > mpi3-ft at lists.mpi-forum.org
> > > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >
> > >
> __________________________________________________________
> ____
> > > __________
> > > Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm CASC
> > > @ Lawrence Livermore National Laboratory, Livermore, USA
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > mpi3-ft mailing list
> > > mpi3-ft at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >
> >
> >
> > --
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 350
> * Knoxville, TN 37996
> * 865 974 6321
> 
> 
> 
>