[Mpi3-ft] Ticket 323 - status?

Ralph Castain rhc at open-mpi.org
Thu May 31 20:56:25 CDT 2012


Hi Aurelien

Long time since we last chatted - hope all is well.

I understand your points and appreciate the clarification. Indeed, I
supported the argument for exactly this more limited version - i.e.,
limited to notification of an error - and still do.

Your comments about lack of time are also compelling. I think there is
confusion (on my part, as well as others) as to what the "FT working group"
is proposing. We've heard a rather broad range of "proposals", including
full run-thru all the way down to notification only. It is a little hard
sometimes to keep it all straight.

I can't speak for others, but I will personally take the time to look at
your code. I'm sure you have already done so, but could you humor me and
point us to it again?

Thanks
Ralph


On Thu, May 31, 2012 at 9:25 AM, Aurélien Bouteiller
<bouteill at eecs.utk.edu>wrote:

>
> Le 31 mai 2012 à 12:08, Bronis R. de Supinski a écrit :
>
> >
> > Aurelien:
> >
> > The issue is not primarily for implementations that choose
> > not to support the interface. the concern is more about how
> > much complexity is added when the interface is supported.
> > If the interface is added to the standard, not supporting
> > it will be seen as a quality of implementation issue and
> > most implementation will be forced to support it.
> Bronis,
>
> I have not made myself clear here. The implementation I am talking about
> supports the entire interface. It just does so without reporting errors and
> lets MPI in an undefined state after process failures, as is specifically
> allowed by the draft of #323. This is not a "quality of implementation"
> issue, as there are very valid and justifiable reasons for an
> implementation to choose not to support fault tolerance, such as when the
> target hardware is reliable. However, in such an implementation, an FT
> application is still portable  and supported, but it will not survive
> failures (due to lack of support from the MPI layer). Cost on implementors
> is ridiculous (implementing empty stubs mostly, the most "complex" function
> is agree, which is a straight remaps to allreduce). The cost in performance
> is null, zero.
>
> > Ultimately, the biggest concern was that insufficient time was
> > allowed to assess the new proposal, as well as alternatives.
> > That assessment may lead to the conclusion that proposal does
> > not suffer from many of the problems of the previous one but
> > we did not have the information or time to be certain.
> >
> I certainly have to agree with that. This newer proposal has been designed
> around the core idea of not intruding in the failure free performance path
> and being easy to implement (these are  very related issues, in practice).
> However,  the tight time schedule left little time, if any, to have a
> thorough evaluation outside the working group. Publications are under
> review, and the implementation is now out. We hope that outsiders will take
> the time to evaluate more, as we believe facts will mostly clear fears that
> were still validly lingering due to lack of time.
>
>
> Aurelien
>
>
> >
> >
> > On Thu, 31 May 2012, Aurélien Bouteiller wrote:
> >
> >> The proposal does not add significant complexity to the code base,
> especially when no fault tolerance is supported. I'm not waiving hands
> here, we have an implementation of this proposal that does not support
> fault tolerance, the diff is less than 1k lines total, mostly "stub" code
> for the interfaces that remaps to existing MPI constructs. On the matter of
> most interest to you, no changes are required in the runtime in this case
> (not a single line diff). Please have a look at the implementation
> yourself, it is available.
> >>
> >> Beside this proposal is not defining a recovery model, it is merely
> defining the error raised when failure happen and the following state of
> MPI, nothing more. Defining proper recovery models is left to user-level
> libraries, which benefit from this proposal by being portable across
> different MPI versions (even those that do not support effective fault
> tolerance because the target machine is stable, or the implementors were
> lazy).
> >>
> >> The large number of abstain votes proves that we didn't had time to
> convince people of the soundness of the approach, there was still
> significant uncertainty at the time  of the vote in the mind of many
> participants. This is an important information, as it seems we often hear
> comments that are not founded on hard facts, but rather on feelings or
> fears (including that the text had been changed very close to the deadline,
> a very valid concern). We might want to make sure to publicize more the
> results we already have, and arguably, the tight  schedule for 3.0
> inclusion didn't left that much opportunity for this and has been
> detrimental.
> >>
> >> Aurelien
> >>
> >>
> >> Le 31 mai 2012 à 09:07, Ralph Castain a écrit :
> >>
> >>> Guess we can agree to disagree on your conclusions. It is true that
> there are non-HPC users that want to run past errors (my own org included),
> but that problem is relatively easier to solve than the overall MPI issue
> as it doesn't involve things like collectives. As for HPC, I'll note that
> the OMPI mailing list has nearly zero questions for even the current level
> of FT, and those that do come in are from students doing research (not
> production users). And yes, I know bigger clusters are in the works, but as
> I noted, there are other solutions being worked for them too.
> >>>
> >>> My point was solely that there are multiple ways of addressing the
> problem, and that making major alterations to MPI is only one of them - and
> possibly a less attractive approach to some of the alternatives. So it may
> well be that people would prefer to leave MPI alone for now, pending more
> research that reduces the risk and performance penalties.
> >>>
> >>> In that vein, I would suggest you consider releasing an "FT-MPI"
> again, or perhaps defining a quasi-standard set of MPI extensions that an
> implementation that wants to provide an FT version could support. The
> latter would still meet your desire to provide FT to a broader audience,
> but would avoid forcing ALL implementations to support something that their
> customers may not want.
> >>>
> >>> Let me re-emphasize: I'm not saying this was all for naught. I'm
> trying to suggest reasons why the body might resist adopting this proposal,
> and possible paths forward that would still support the work. Forcing a
> community to adopt FT as "standard" may not be the best way in the near
> term.
> >>>
> >>> Ralph
> >>>
> >>>
> >>> On Wed, May 30, 2012 at 10:47 PM, Richard Graham <
> richardg at mellanox.com> wrote:
> >>> Actually, this is a working group that went out and spent quite a bit
> of effort to collect user input.  The issue is that it took a couple of
> years until someone had time to start an implementation, then took time for
> users to try it out and provide feedback, and then the broader forum
> started providing more input once text was actually written.
> >>>
> >>>
> >>>
> >>> It is true that this is important to HPC, however it is also true that
> there are quite a few users outside of the HPC community that would like to
> use MPI, but can’t because most/all MPI’s currently terminate on process
> failure.  We had users from the HPC community attend for a while, users
> with enterprise needs, and we even had input from an online gaming company.
>  Even as recently as a couple of weeks ago this same point was brought up
> by some at Mellanox, when this was not even the topic of discussion.  This
> is far from just a research activity of interest to a small number of users.
> >>>
> >>>
> >>>
> >>> Rich
> >>>
> >>>
> >>>
> >>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Ralph Castain
> >>> Sent: Thursday, May 31, 2012 11:45 AM
> >>>
> >>>
> >>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >>> Subject: Re: [Mpi3-ft] Ticket 323 - status?
> >>>
> >>>
> >>>
> >>> Obviously, I can't speak for the folks who attended and voted "no",
> either directly or by abstaining. However, I have talked to at least a few
> people, and can offer a point or two about the concerns.
> >>>
> >>> First, the last study I saw published on the subject of FT for MPI
> showed a very low level of interest in FT within the MPI community. It
> based this on a usage analysis that showed something over 90% of clusters
> being too small to see large failure rates. On the clusters that were large
> enough (primarily at the national labs, who pretty clearly voted no), over
> 80% of the MPI jobs lasted less than 1 hour.
> >>>
> >>> So the size of the community that potentially benefits from FT is very
> small. In contrast, despite assurances it would be turned off unless
> specifically requested, it was clear from the proposals that FT would
> impact a significant fraction of the code, thus raising the potential for a
> substantial round of debugging and instability.
> >>>
> >>> For that majority who would see little-to-no benefit, this isn't an
> attractive trade-off.
> >>>
> >>> Second, those who possibly could benefit tend to take a more holistic
> view of FT. If you step back and look at the cluster as a system, then
> there are multiple ways of addressing the problems of failure during long
> runs. Yes, one way is to harden MPI to such events, but that is probably
> the hardest solution.
> >>>
> >>> One easier way, and the one being largely touted at the moment, is to
> make checkpointing of an application be a relatively low-cost event so that
> it can be frequently done. This is being commercialized as we speak by the
> addition of SSDs to the usual parallel file system, thus making a
> checkpoint run at very fast speeds. In fact, "burst" buffers are allowing
> the checkpoint to dump very quickly, and then slowly drain to disk,
> rendering the checkpoint operation very low cost. Given that the commercial
> interests coincide with the HPC interests, this solution is likely to be
> available from cluster suppliers very soon at an attractive price.
> >>>
> >>> Combined with measures to make restart very fast as well, this looks
> like an alternative that has no performance impact on the application at
> the MPI level, doesn't potentially destabilize the software, and may meet
> the majority of needs.
> >>>
> >>> I'm not touting this approach over any other, mind you - just trying
> to point out that the research interests of the FT/MPI group needs to be
> considered in a separate light from the production interests of the
> community. What you may be experiencing (from my limited survey) is a
> reflection of that divergence.
> >>>
> >>> Ralph
> >>>
> >>>
> >>> On Wed, May 30, 2012 at 6:55 PM, George Bosilca <bosilca at eecs.utk.edu>
> wrote:
> >>>
> >>> On May 31, 2012, at 08:44 , Martin Schulz wrote:
> >>>
> >>>
> >>>
> >>>
> >>> Several people who abstained had very similar concerns, but chose the
> abstain vote since they know it meant no,
> >>>
> >>>
> >>>
> >>> Your interpretation is barely making a "simple majority" in the forum,
> as highlighted by parallel discussions in the other email threads. But
> let's leave this discussion in its own thread.
> >>>
> >>>
> >>>
> >>> But you're right, both "no" and "abstain" votes should be addressed.
> Bill made his point very clear, and to be honest he was the only one that
> raised a __valid__ point about the FT proposal. Personally, I am looking
> forward to fruitful discussions during our weekly phone-calls where the
> complaints raised during the voting will be brought forward in the way that
> the working group will have a real opportunity to address them as they
> deserve. In other terms we are all counting on you guys to enlighten us on
> the major issues with this proposal and the potential solutions you
> envision or promote.
> >>>
> >>>
> >>>
> >>>  george.
> >>>
> >>>
> >>>
> >>> On May 31, 2012, at 08:44 , Martin Schulz wrote:
> >>>
> >>>
> >>>
> >>>
> >>> Hi George,
> >>>
> >>>
> >>>
> >>> One other no was Intel as far as I remember, but I don't remember the
> 5th. However, I would suggest not to focus on the no votes alone. Several
> people who abstained had very similar concerns, but chose the abstain vote
> since they know it meant no, but they agreed with the general necessity of
> FT for MPI. I remember, for example, Bill saying that for him abstain meant
> no, but that changes later on could change his mind. Based on this
> interpretation, the ticket definitely had more than 5 no votes.
> >>>
> >>>
> >>>
> >>> Martin
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On May 31, 2012, at 8:34 AM, Darius Buntinas wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Argonne was not convinced that we (FTWG) had the right solution, and
> the large changes in the text mentioned previously did not instill
> confidence.  So it was decided that Argonne would vote against the ticket.
> >>>
> >>> -d
> >>>
> >>> On May 30, 2012, at 6:24 PM, George Bosilca wrote:
> >>>
> >>>
> >>>
> >>> In total there were 5 no votes. I wonder who were the other two, they
> might be willing to enlighten us on their reasons to vote against.
> >>>
> >>>
> >>>
> >>> george.
> >>>
> >>>
> >>>
> >>> On May 31, 2012, at 05:48 , Anthony Skjellum wrote:
> >>>
> >>>
> >>>
> >>> Three no votes were LLNL, Argonne, and Sandia.  Since MPI is heavily
> driven by DOE, convincing these folks would be important.
> >>>
> >>>
> >>>
> >>> Tony Skjellum, tonyskj at yahoo.com or skjellum at gmail.com
> >>>
> >>> Cell 205-807-4968
> >>>
> >>>
> >>>
> >>> On May 31, 2012, at 5:10 AM, Richard Graham <richardg at mellanox.com>
> wrote:
> >>>
> >>>
> >>>
> >>> The main objection raised is that the text has still been having large
> changes, and if not for the pressure of the 3.0 deadline, this would not
> have come up for a vote.  I talked one-on-one with many that either voted
> against or abstained, and this was the major (not only) point raised.
> >>>
> >>>
> >>>
> >>> Rich
> >>>
> >>>
> >>>
> >>> -----Original Message-----
> >>>
> >>> From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:
> mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Aurélien Bouteiller
> >>>
> >>> Sent: Wednesday, May 30, 2012 10:05 PM
> >>>
> >>> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >>>
> >>> Subject: Re: [Mpi3-ft] Ticket 323 - status?
> >>>
> >>>
> >>>
> >>> It seems we had very little, if any, technical opposition on the
> content of the proposal itself, but mostly comments on the process. I think
> we need to understand more what are the oppositions. Do we have a list of
> who voted for and against and their rationale?
> >>>
> >>>
> >>>
> >>> Aurelien
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Le 30 mai 2012 à 08:52, Josh Hursey a écrit :
> >>>
> >>>
> >>>
> >>> That is unfortunate. A close vote (7 yes to 9 no/abstain). :/
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Josh
> >>>
> >>>
> >>>
> >>> On Wed, May 30, 2012 at 8:38 AM, Thomas Herault
> >>>
> >>> <herault.thomas at gmail.com> wrote:
> >>>
> >>> Le 30 mai 2012 a 01:44, George Bosilca a écrit:
> >>>
> >>>
> >>>
> >>> The ticket has been voted down. Come back in 6 months, maybe 3.1. The
> votes were 7 yes, 4 abstains and 5 no.
> >>>
> >>>
> >>>
> >>> Thomas
> >>>
> >>>
> >>>
> >>> Le 30 mai 2012 à 07:02, Josh Hursey a écrit :
> >>>
> >>>
> >>>
> >>> How did the vote go for the fault tolerance ticket 323?
> >>>
> >>>
> >>>
> >>> -- Josh
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Joshua Hursey
> >>>
> >>> Postdoctoral Research Associate
> >>>
> >>> Oak Ridge National Laboratory
> >>>
> >>> http://users.nccs.gov/~jjhursey
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Joshua Hursey
> >>>
> >>> Postdoctoral Research Associate
> >>>
> >>> Oak Ridge National Laboratory
> >>>
> >>> http://users.nccs.gov/~jjhursey
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> * Dr. Aurélien Bouteiller
> >>>
> >>> * Researcher at Innovative Computing Laboratory
> >>>
> >>> * University of Tennessee
> >>>
> >>> * 1122 Volunteer Boulevard, suite 350
> >>>
> >>> * Knoxville, TN 37996
> >>>
> >>> * 865 974 9375
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>>
> >>> mpi3-ft mailing list
> >>>
> >>> mpi3-ft at lists.mpi-forum.org
> >>>
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> ________________________________________________________________________
> >>> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
> >>> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>>
> >>> _______________________________________________
> >>> mpi3-ft mailing list
> >>> mpi3-ft at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >> --
> >> * Dr. Aurélien Bouteiller
> >> * Researcher at Innovative Computing Laboratory
> >> * University of Tennessee
> >> * 1122 Volunteer Boulevard, suite 309b
> >> * Knoxville, TN 37996
> >> * 865 974 9375
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
>
>
>
>
>
>
>
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120531/a3d95fce/attachment-0001.html>


More information about the mpiwg-ft mailing list