[Mpi3-ft] Ticket 323 - status?

Ralph Castain rhc at open-mpi.org
Thu May 31 07:57:33 CDT 2012


Sure - I'll return to my office in about 10 days and will pass it along
then. It was a LANL study that is frequently cited.

On Thu, May 31, 2012 at 6:32 AM, Josh Hursey <jjhursey at open-mpi.org> wrote:

> Ralph,
>
> You site a published study. Can you provide a link to the resource?
>
> -- Josh
>
> On Wed, May 30, 2012 at 10:18 PM, Ralph Castain <rhc at open-mpi.org> wrote:
> > Obviously, I can't speak for the folks who attended and voted "no",
> either
> > directly or by abstaining. However, I have talked to at least a few
> people,
> > and can offer a point or two about the concerns.
> >
> > First, the last study I saw published on the subject of FT for MPI
> showed a
> > very low level of interest in FT within the MPI community. It based this
> on
> > a usage analysis that showed something over 90% of clusters being too
> small
> > to see large failure rates. On the clusters that were large enough
> > (primarily at the national labs, who pretty clearly voted no), over 80%
> of
> > the MPI jobs lasted less than 1 hour.
> >
> > So the size of the community that potentially benefits from FT is very
> > small. In contrast, despite assurances it would be turned off unless
> > specifically requested, it was clear from the proposals that FT would
> impact
> > a significant fraction of the code, thus raising the potential for a
> > substantial round of debugging and instability.
> >
> > For that majority who would see little-to-no benefit, this isn't an
> > attractive trade-off.
> >
> > Second, those who possibly could benefit tend to take a more holistic
> view
> > of FT. If you step back and look at the cluster as a system, then there
> are
> > multiple ways of addressing the problems of failure during long runs.
> Yes,
> > one way is to harden MPI to such events, but that is probably the hardest
> > solution.
> >
> > One easier way, and the one being largely touted at the moment, is to
> make
> > checkpointing of an application be a relatively low-cost event so that it
> > can be frequently done. This is being commercialized as we speak by the
> > addition of SSDs to the usual parallel file system, thus making a
> checkpoint
> > run at very fast speeds. In fact, "burst" buffers are allowing the
> > checkpoint to dump very quickly, and then slowly drain to disk, rendering
> > the checkpoint operation very low cost. Given that the commercial
> interests
> > coincide with the HPC interests, this solution is likely to be available
> > from cluster suppliers very soon at an attractive price.
> >
> > Combined with measures to make restart very fast as well, this looks
> like an
> > alternative that has no performance impact on the application at the MPI
> > level, doesn't potentially destabilize the software, and may meet the
> > majority of needs.
> >
> > I'm not touting this approach over any other, mind you - just trying to
> > point out that the research interests of the FT/MPI group needs to be
> > considered in a separate light from the production interests of the
> > community. What you may be experiencing (from my limited survey) is a
> > reflection of that divergence.
> >
> > Ralph
> >
> >
> >
> > On Wed, May 30, 2012 at 6:55 PM, George Bosilca <bosilca at eecs.utk.edu>
> > wrote:
> >>
> >> On May 31, 2012, at 08:44 , Martin Schulz wrote:
> >>
> >> Several people who abstained had very similar concerns, but chose the
> >> abstain vote since they know it meant no,
> >>
> >>
> >> Your interpretation is barely making a "simple majority" in the forum,
> as
> >> highlighted by parallel discussions in the other email threads. But
> let's
> >> leave this discussion in its own thread.
> >>
> >> But you're right, both "no" and "abstain" votes should be addressed.
> Bill
> >> made his point very clear, and to be honest he was the only one that
> raised
> >> a __valid__ point about the FT proposal. Personally, I am looking
> forward to
> >> fruitful discussions during our weekly phone-calls where the complaints
> >> raised during the voting will be brought forward in the way that the
> working
> >> group will have a real opportunity to address them as they deserve. In
> other
> >> terms we are all counting on you guys to enlighten us on the major
> issues
> >> with this proposal and the potential solutions you envision or promote.
> >>
> >>   george.
> >>
> >> On May 31, 2012, at 08:44 , Martin Schulz wrote:
> >>
> >> Hi George,
> >>
> >> One other no was Intel as far as I remember, but I don't remember the
> 5th.
> >> However, I would suggest not to focus on the no votes alone. Several
> people
> >> who abstained had very similar concerns, but chose the abstain vote
> since
> >> they know it meant no, but they agreed with the general necessity of FT
> for
> >> MPI. I remember, for example, Bill saying that for him abstain meant
> no, but
> >> that changes later on could change his mind. Based on this
> interpretation,
> >> the ticket definitely had more than 5 no votes.
> >>
> >> Martin
> >>
> >>
> >> On May 31, 2012, at 8:34 AM, Darius Buntinas wrote:
> >>
> >>
> >> Argonne was not convinced that we (FTWG) had the right solution, and the
> >> large changes in the text mentioned previously did not instill
> confidence.
> >>  So it was decided that Argonne would vote against the ticket.
> >>
> >> -d
> >>
> >> On May 30, 2012, at 6:24 PM, George Bosilca wrote:
> >>
> >> In total there were 5 no votes. I wonder who were the other two, they
> >> might be willing to enlighten us on their reasons to vote against.
> >>
> >>
> >> george.
> >>
> >>
> >> On May 31, 2012, at 05:48 , Anthony Skjellum wrote:
> >>
> >>
> >> Three no votes were LLNL, Argonne, and Sandia.  Since MPI is heavily
> >> driven by DOE, convincing these folks would be important.
> >>
> >>
> >> Tony Skjellum, tonyskj at yahoo.com or skjellum at gmail.com
> >>
> >> Cell 205-807-4968
> >>
> >>
> >> On May 31, 2012, at 5:10 AM, Richard Graham <richardg at mellanox.com>
> wrote:
> >>
> >>
> >> The main objection raised is that the text has still been having large
> >> changes, and if not for the pressure of the 3.0 deadline, this would not
> >> have come up for a vote.  I talked one-on-one with many that either
> voted
> >> against or abstained, and this was the major (not only) point raised.
> >>
> >>
> >> Rich
> >>
> >>
> >> -----Original Message-----
> >>
> >> From: mpi3-ft-bounces at lists.mpi-forum.org
> >> [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Aurélien
> >> Bouteiller
> >>
> >> Sent: Wednesday, May 30, 2012 10:05 PM
> >>
> >> To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >>
> >> Subject: Re: [Mpi3-ft] Ticket 323 - status?
> >>
> >>
> >> It seems we had very little, if any, technical opposition on the content
> >> of the proposal itself, but mostly comments on the process. I think we
> need
> >> to understand more what are the oppositions. Do we have a list of who
> voted
> >> for and against and their rationale?
> >>
> >>
> >> Aurelien
> >>
> >>
> >>
> >> Le 30 mai 2012 à 08:52, Josh Hursey a écrit :
> >>
> >>
> >> That is unfortunate. A close vote (7 yes to 9 no/abstain). :/
> >>
> >>
> >> Thanks,
> >>
> >> Josh
> >>
> >>
> >> On Wed, May 30, 2012 at 8:38 AM, Thomas Herault
> >>
> >> <herault.thomas at gmail.com> wrote:
> >>
> >> Le 30 mai 2012 a 01:44, George Bosilca a écrit:
> >>
> >>
> >> The ticket has been voted down. Come back in 6 months, maybe 3.1. The
> >> votes were 7 yes, 4 abstains and 5 no.
> >>
> >>
> >> Thomas
> >>
> >>
> >> Le 30 mai 2012 à 07:02, Josh Hursey a écrit :
> >>
> >>
> >> How did the vote go for the fault tolerance ticket 323?
> >>
> >>
> >> -- Josh
> >>
> >>
> >> --
> >>
> >> Joshua Hursey
> >>
> >> Postdoctoral Research Associate
> >>
> >> Oak Ridge National Laboratory
> >>
> >> http://users.nccs.gov/~jjhursey
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >>
> >> --
> >>
> >> Joshua Hursey
> >>
> >> Postdoctoral Research Associate
> >>
> >> Oak Ridge National Laboratory
> >>
> >> http://users.nccs.gov/~jjhursey
> >>
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >> --
> >>
> >> * Dr. Aurélien Bouteiller
> >>
> >> * Researcher at Innovative Computing Laboratory
> >>
> >> * University of Tennessee
> >>
> >> * 1122 Volunteer Boulevard, suite 350
> >>
> >> * Knoxville, TN 37996
> >>
> >> * 865 974 9375
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >> mpi3-ft mailing list
> >>
> >> mpi3-ft at lists.mpi-forum.org
> >>
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >> _______________________________________________
> >> mpi3-ft mailing list
> >> mpi3-ft at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >> ________________________________________________________________________
> >> Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulzm
> >> CASC @ Lawrence Livermore National Laboratory, Livermore, USA
> >>
> >>
> >>
> >> _______________________________________________
> >> mpi3-ft mailing list
> >> mpi3-ft at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >>
> >>
> >>
> >> _______________________________________________
> >> mpi3-ft mailing list
> >> mpi3-ft at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >
> >
> > _______________________________________________
> > mpi3-ft mailing list
> > mpi3-ft at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20120531/ebeef472/attachment-0001.html>


More information about the mpiwg-ft mailing list