[Mpi3-ft] Communicator Virtualization as a step forward

Thu Feb 19 10:16:32 CST 2009

I understand why you want to specify what a level of support means 
but I think that this cannot and should not be done as part of the MPI spec.

It cannot be done because we simply have no concepts in MPI such as a 
switch or a router or a compute node. As such, how do we talk about 
the failure of one of these? Also, I cannot imagine that we'll be 
able to foresee the failure modes of systems that will be fielded in 
10 years. How can we possibly write about them? We can draft a 
separate document of best practices that provides specific examples 
on specific systems. However, such details don't belong in the spec.

It should not be done because it is unnecessary. Users don't care 
about which events trigger fault notification. They care that the 
rate of unrecoverable errors is lower than a given bound and the rate 
or recoverable errors is lower than some other, much looser bound. A 
given system (MPI + hardware) can satisfy this requirement in such a 
wide variety of ways that it is simply inadvisable to restrict it. 
For example, the system designers can simply spend more money on more 
reliable hardware. This is the choice made by BlueGene. Others may 
save money on hardware but provide a more reliable MPI that survives 
the failures that happen to be most common on this system.

Given that it is a bad idea to put these constraints in the MPI spec, 
I suggest two options that we can take. One is to write a separate 
document documenting some best practices. Another is to be very 
public about the techniques used to make the reference implementation 
support fault notification so that other implementations can follow 
suit. In the end I expect this feature to follow system designers and 
user demands. If a given user writes into their contract a certain 
bound on unrecoverable failures, the system vendor will have to work 
out how much money/performance to spend on hardware and how much on 
improved fault notification support. The entire point of the fault 
notification API is to make it possible for users and system 
designers to talk about recoverable vs unrecoverable failures, rather 
than just unrecoverable ones. In the end, I think that the whole 
thing will be driven by the reliability of large-scale companies such 
as US national labs.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
http://greg.bronevetsky.com

At 07:18 AM 2/19/2009, Supalov, Alexander wrote:
>Thanks, this makes sense.
>
>I expect that performance overheads will be noticeable, so that 
>there will most likely be FT- and nonFT-versions of the MPI 
>libraries, just like now there are MT- and nonMT-versions, basically 
>for the same reason. Still, even there people can ask for different 
>level of MT support, thus matching their actual needs to the level 
>of service provided and thus the expected overheads.
>
>This is where I want to cite your reply: "since most MPI 
>implementations want to support large-scale systems as well as 
>smaller ones, these implementations will provide a way for 
>applications to request different levels of fault notification support".
>
>If this is foreseeable, why not helping out with this right now? Do 
>we think the problem is too hard to solve or do we want to let MPIs 
>settle into their ways and thus practically identify the most 
>reasonable levels to be standardized later?
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org 
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Tuesday, February 17, 2009 6:18 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>Yes, this is definitely a topic for voice conversation but let me
>continue this over email because I think that I can describe my
>position more precisely based on your comments. The definition of
>whether fault notification API is working or not is Not whether the
>underlying MPI implementation is guaranteed to do something when some
>low-level event happens. It is working if the rate of unrecoverable
>failures is below on application-specified bound, with all the
>remaining failures converted into notifications. This is the same for
>MPI_Send/MPI_Recv, which work not if they provide a certain level of
>performance but whether they deliver data or not. These are both
>high-level specifications and we don't care about how they are
>implemented at the network level.
>
>Having said that, there is one large difference between the two
>cases, which is the social dynamics. The MPI_Send/MPI_Recv API has
>non-trivial support in all MPI implementations and you can get better
>MPI performance by either using the same MPI implementation on a
>better machine or by switching to a higher performing MPI
>implementation. This fluidity works to help encourage MPI
>implementations to support MPI_Send/MPI_Recv because application
>developers will be confident that they'll get good use out them.
>
>In contrast, with the fault notification API the odds are that some
>MPI implementations will provide trivial support for the fault
>notification API while others will provide good or configurable
>support. As such, if a developer wants their unrecoverable error rate
>to be below a certain bound, they will have less freedom to match MPI
>implementations to systems because if a given machine has too high an
>error rate, an unreliable MPI implementation will not provide
>sufficient reliability no matter what you do. You will need to switch
>to a different MPI implementation. Your claim is that this will cause
>application developers to not bother coding fault tolerance into
>their applications because they can't be sure that it will do any
>good on their favorite MPI implementation. This makes sense but I
>believe that it is incorrect.
>
>The reason is that the driver for fault tolerance is not applications
>or MPI implementations but the hardware systems. Today's large
>machines are sufficiently unreliable that today's applications have
>to provide their own fault tolerance solutions without any support
>from MPI. For large-scale applications, fault tolerance is not a
>luxury but a necessity, so if they need to switch MPIs in order to
>get it, they will (assuming that not too much performance is lost).
>If an existing application only runs on small, reliable systems, it
>will never need to be recoded to use the fault notification API.
>However, if it grows to a large enough scale for faults to be a
>problem, fault notification support will need to be coded in because
>there is no other choice besides just failing all the time. As such,
>the reduced "liquidity" of the fault notification API is more than
>offset by the strong driving force that system failures have on
>large-scale applications. Using it will not be a choice but a
>necessity and as such, there will be a sub-community of
>applications/MPI implementations that use it/provide it, and the rest
>who ignore it. More likely, since most MPI implementations want to
>support large-scale systems as well as smaller ones, these
>implementations will provide a way for applications to request
>different levels of fault notification support, which will then
>overcome the "liquidity" problem that you've identified.
>
>So that's my argument in favor of fault notification. Fault
>notification is a well-defined and useful API that is less "liquid"
>than much of the existing MPI specification. However, because there
>is a significant sub-community of applications and systems on which
>faults are a real problem, this weakness will be more than offset by
>the sheer necessity of running on systems that fail frequently. In
>this context the fault notification API will allow MPI implementation
>to bound the rate of unrecoverable failures even on unreliable
>hardware platforms.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>http:// greg.bronevetsky.com
>
>At 04:55 AM 2/17/2009, Supalov, Alexander wrote:
> >Hi,
> >
> >Thanks. I think we're coming from different corners, and the main
> >problem is that the two criteria of yours, namely, "self-consistent
> >specification and usefulness of the chosen abstraction level" are
> >not sufficient for me. The key here appears to be in the
> >"usefulness" that you seem to understand as "this can be used in
> >principle on my future Petascale machine running my favorite MPI"
> >and mine "to be used in a real application, an API should be useful
> >in a wide, a priori known variety of practically relevant situations
> >that do not depend on the MPI implementation at hand or the 
> platform involved".
> >
> >Maybe we should have a call or something to discuss this live,
> >without email getting into way. The rest is just an outline of what
> >I'd have to say if we met.
> >
> >I say that we should define a standard set of the faults that will
> >be detected, and then say what implementations should provide what
> >level of service to be compliant with a particular level of the FT
> >support that the standard is to specify in clear terms. If we don't
> >do this, we will only prepare a good ground for MPI-4, were we will
> >have to fix the flaws of the MPI-3. I'm afraid that at the moment
> >we're driving in this very direction, and this is why.
> >
> >The ultimate test of a spec is a number of implementations that are
> >widely used as specified. I'm concerned that the current
> >notification spec is too weak to be appealing to the commercial
> >users and implementors. I'm concerned about this because the
> >semantics of an API that kicks in under unknown circumstances are
> >surely ill-defined, and the ROI of using it in any given application
> >cannot be assessed upfront. This means that the investment is
> >unlikely to be made - either on the commercial MPI implementation or
> >on the commercial MPI application side of the equation.
> >
> >In other words, why would I care to use an API if I were not aware
> >of when and how this would help my application run weeks rather than
> >days between the faults? And I don't mean here a student with a
> >diploma or PhD thesis to write and forget about. I mean real life
> >commercial developers who need to justify every bit of what they are
> >doing by showing positive ROI to their management in the times of
> >worldwide economic crisis. Justify by promising to earn more money
> >than they are going to spend for the development, or else.
> >
> >Now, I can't comment on the networking topology example of yours
> >because I cannot fully follow the logic. Let me try to give another,
> >hopefully more practical example.
> >
> >Consider interaction of the MPI_Init/MPI_Finalize with the
> >underlying OS and job managers. It's about as undefined as that of
> >the mentioned calls with the checkpointer. Nevertheless, many usable
> >implementations can cope with that quite nicely. By the way, this is
> >how the checkpoint/restarting could work as well, providing the
> >checkpointer at hand with the MPI ready to be checkpointed/restarted
> >at a well defined point of the MPI program. After all, this is how
> >it is done now: a configuration flag tells the MPI what checkpointer
> >to expect. This could be refined, turned into dynamic recognition of
> >the active checkpointer, etc. This is all trivial.
> >
> >Imagine now that we have an API instead of the MPI_Init/MPI_Finalize
> >that says: "Well, it sort of starts and terminates an MPI job, but
> >one cannot guarantee how many processes will be started nor whether
> >they are usable after this because one cannot explain really in what
> >conditions the above is true. Anyway, this is all too low level for
> >the MPI standard to deal with. So be it because we say this
> >interface is self-contained and useful."
> >
> >The provided interface is self-consistent. It sort of starts a job.
> >Is it practically useful? No, because one cannot know when it works
> >at all. What is the expected commercial user reaction to this
> >interface? Steer clear of it or, even worse, assume on the limited
> >past student experience that the MPI implementation at hand is the
> >only correct one, and then blame all others that they do a sloppy
> >job when the product should be ported elsewhere. Neither would be
> >good for the MPI standard.
> >
> >Cf. process spawning and 1-sided comm, the biggest failures of the
> >MPI standard to date in practical terms. Were they self-consistent
> >and useful in the sense that you advocate? Sure. Were they useful in
> >the sense that I advocate? No. And they duly faded into obscurity,
> >because they did not pass the ultimate test even though they did
> >meet the criteria that you claim are necessary and sufficient for
> >the MPI standard. I pray you consider this before we exchange
> >another couple of emails. I'm just trying to help.
> >
> >Best regards.
> >
> >Alexander
> >
> >-----Original Message-----
> >From: mpi3-ft-bounces at lists.mpi-forum.org
> >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
> >Sent: Monday, February 16, 2009 5:58 PM
> >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> >
> >MPI never places conditions on how the MPI implementation does its
> >job. It never says whether MPI uses static or dynamic routing or how
> >much performance degrades as a result of the application using a
> >particular communication pattern, which depends closely on the
> >physical network topology. These are simply issues that are too low
> >level for the MPI spec to define although they very much matter to
> >application developers. What it does instead is define a set of
> >semantics for MPI_Send, MPI_Recv, etc. that apply regardless of all
> >those details and accepts the compromise: its an internally
> >self-consistent spec that doesn't fully define all the relevant facts
> >about the system. They key thing is that it is a useful compromise.
> >Thus, we have two tests for a candidate API: self-consistent
> >specification and usefulness of the chosen abstraction level.
> >
> >The fault notification API makes exactly the same compromise as the
> >overall MPI spec. It doesn't say anything about how faults are
> >detected since that is a very low-level network matter. However, it
> >presents a self-consistent high-level specification that allows
> >applications to react to any such errors. Furthermore, it is clearly
> >useful. It is a red herring to worry about which low-level events
> >will cause which high-level notifications. The only relevant thing is
> >the probability of unrecoverable errors. Applications do not want
> >their applications to randomly abort with a frequency higher than
> >once every few days or weeks. If it is higher, those unrecoverable
> >failures must be converted into recoverable failures by the MPI
> >library and given to the application via the fault notification API.
> >This is the entire function of the fault notification API: to allow
> >MPI to convert unrecoverable system failures (currently they're all
> >unrecoverable) into recoverable failures. This makes it possible for
> >customers to buy systems that fail relatively frequently while making
> >them usable by making their applications fault tolerant. Thus, the
> >fault notification API is both self-consistent and useful, passing
> >both tests of the MPI spec.
> >
> >In contrast, the checkpointing API is useful but not self-consistent
> >API. Its semantics require details (i.e. interactions with the
> >checkpointer) that are too low-level to be specified in the MPI spec.
> >As a result, it needs additional mechanisms that allow individual MPI
> >implementations to provide the information that cannot be detailed in
> >the MPI spec.
> >
> >Thus, these two APIs are not at all similar unless you wish to argue
> >that 1. the MPI spec is ill-defined because it doesn't specify the
> >network topology or that 2. the semantics of being notified of a
> >fault are ill-defined. If you wish to argue the latter, I would love
> >to see examples because they would need to be fixed before this API
> >is ready to go before the forum.
> >
> >Greg Bronevetsky
> >Post-Doctoral Researcher
> >1028 Building 451
> >Lawrence Livermore National Lab
> >(925) 424-5756
> >bronevetsky1 at llnl.gov
> >
> >At 06:04 AM 2/16/2009, Supalov, Alexander wrote:
> > >Thanks. I think that since the notification API does not provide any
> > >guarantee as to what kind of faults is treated how, the whole thing
> > >becomes a negotiation between the MPI implementation and the
> > >underlying networking layers. Moreover, it becomes a negotiation of
> > >sorts between the application and the MPI implementation, because
> > >the application cannot know upfront what faults will be treated what way.
> > >
> > >This is, in my mind, is very comparable to, if not worse than the
> > >negotiation between the MPI_prepare_for_checkpoint &
> > >MPI_Restart_after_chekpoint implementation on one hand, and the
> > >checkpointer involved on the other hand.
> > >
> > >Frankly, I don't see any difference here, or, if any, one in favor
> > >of the checkpointing interface.
> > >
> > >Anyway, thanks for clarification.
> > >
> > >-----Original Message-----
> > >From: mpi3-ft-bounces at lists.mpi-forum.org
> > >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
> > >Sent: Friday, February 13, 2009 7:18 PM
> > >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> > >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> > >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> > >
> > >At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
> > > >Thanks. Could you please clarify to me, if possible, using some
> > > >practically relevant example, how fault notification for a set of
> > > >undefined fault types that may vary from MPI implementation to
> > > >implementation differs from the equally abstract
> > > >MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
> > > >implementation at hand for the checkpoint action done by the
> > > >checkpointing system involved, and then semantically clearly recover
> > > >the MPI part of the program after the system restore?
> > >
> > >Simple. As you've pointed out, the checkpointing API is well defined
> > >from the application's point of view. However, its semantics are weak
> > >from the checkpointer's point of view. Seen from this angle, it is
> > >not clear what the checkpointer can expect from the MPI library and
> > >the whole thing devolves into a negotiation between individual
> > >checkpointers and individual MPI libraries on a variety of specific
> > >system configurations. In contrast, the fault notification API only
> > >has an application view, which is in fact well-defined. The weakness
> > >of the fault notification API is what you've already described, that
> > >it provides no guarantees about the quality of the implementation in
> > >a way that is more significant than for other portions of MPI, such
> > >as network details for MPI_Send/MPI_Recv.
> > >
> > >Greg Bronevetsky
> > >Post-Doctoral Researcher
> > >1028 Building 451
> > >Lawrence Livermore National Lab
> > >(925) 424-5756
> > >bronevetsky1 at llnl.gov
> > >
> > >_______________________________________________
> > >mpi3-ft mailing list
> > >mpi3-ft at lists.mpi-forum.org
> > >http://   lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> > >---------------------------------------------------------------------
> > >Intel GmbH
> > >Dornacher Strasse 1
> > >85622 Feldkirchen/Muenchen Germany
> > >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> > >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> > >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> > >VAT Registration No.: DE129385895
> > >Citibank Frankfurt (BLZ 502 109 00) 600119052
> > >
> > >This e-mail and any attachments may contain confidential material for
> > >the sole use of the intended recipient(s). Any review or distribution
> > >by others is strictly prohibited. If you are not the intended
> > >recipient, please contact the sender and delete all copies.
> > >
> > >
> > >_______________________________________________
> > >mpi3-ft mailing list
> > >mpi3-ft at lists.mpi-forum.org
> > >http://   lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >---------------------------------------------------------------------
> >Intel GmbH
> >Dornacher Strasse 1
> >85622 Feldkirchen/Muenchen Germany
> >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> >VAT Registration No.: DE129385895
> >Citibank Frankfurt (BLZ 502 109 00) 600119052
> >
> >This e-mail and any attachments may contain confidential material for
> >the sole use of the intended recipient(s). Any review or distribution
> >by others is strictly prohibited. If you are not the intended
> >recipient, please contact the sender and delete all copies.
> >
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http://  lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft