[Mpi3-ft] Communicator Virtualization as a step forward
Supalov, Alexander
alexander.supalov at intel.com
Thu Feb 19 09:18:29 CST 2009
Thanks, this makes sense.
I expect that performance overheads will be noticeable, so that there will most likely be FT- and nonFT-versions of the MPI libraries, just like now there are MT- and nonMT-versions, basically for the same reason. Still, even there people can ask for different level of MT support, thus matching their actual needs to the level of service provided and thus the expected overheads.
This is where I want to cite your reply: "since most MPI implementations want to support large-scale systems as well as smaller ones, these implementations will provide a way for applications to request different levels of fault notification support".
If this is foreseeable, why not helping out with this right now? Do we think the problem is too hard to solve or do we want to let MPIs settle into their ways and thus practically identify the most reasonable levels to be standardized later?
-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
Sent: Tuesday, February 17, 2009 6:18 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
Yes, this is definitely a topic for voice conversation but let me
continue this over email because I think that I can describe my
position more precisely based on your comments. The definition of
whether fault notification API is working or not is Not whether the
underlying MPI implementation is guaranteed to do something when some
low-level event happens. It is working if the rate of unrecoverable
failures is below on application-specified bound, with all the
remaining failures converted into notifications. This is the same for
MPI_Send/MPI_Recv, which work not if they provide a certain level of
performance but whether they deliver data or not. These are both
high-level specifications and we don't care about how they are
implemented at the network level.
Having said that, there is one large difference between the two
cases, which is the social dynamics. The MPI_Send/MPI_Recv API has
non-trivial support in all MPI implementations and you can get better
MPI performance by either using the same MPI implementation on a
better machine or by switching to a higher performing MPI
implementation. This fluidity works to help encourage MPI
implementations to support MPI_Send/MPI_Recv because application
developers will be confident that they'll get good use out them.
In contrast, with the fault notification API the odds are that some
MPI implementations will provide trivial support for the fault
notification API while others will provide good or configurable
support. As such, if a developer wants their unrecoverable error rate
to be below a certain bound, they will have less freedom to match MPI
implementations to systems because if a given machine has too high an
error rate, an unreliable MPI implementation will not provide
sufficient reliability no matter what you do. You will need to switch
to a different MPI implementation. Your claim is that this will cause
application developers to not bother coding fault tolerance into
their applications because they can't be sure that it will do any
good on their favorite MPI implementation. This makes sense but I
believe that it is incorrect.
The reason is that the driver for fault tolerance is not applications
or MPI implementations but the hardware systems. Today's large
machines are sufficiently unreliable that today's applications have
to provide their own fault tolerance solutions without any support
from MPI. For large-scale applications, fault tolerance is not a
luxury but a necessity, so if they need to switch MPIs in order to
get it, they will (assuming that not too much performance is lost).
If an existing application only runs on small, reliable systems, it
will never need to be recoded to use the fault notification API.
However, if it grows to a large enough scale for faults to be a
problem, fault notification support will need to be coded in because
there is no other choice besides just failing all the time. As such,
the reduced "liquidity" of the fault notification API is more than
offset by the strong driving force that system failures have on
large-scale applications. Using it will not be a choice but a
necessity and as such, there will be a sub-community of
applications/MPI implementations that use it/provide it, and the rest
who ignore it. More likely, since most MPI implementations want to
support large-scale systems as well as smaller ones, these
implementations will provide a way for applications to request
different levels of fault notification support, which will then
overcome the "liquidity" problem that you've identified.
So that's my argument in favor of fault notification. Fault
notification is a well-defined and useful API that is less "liquid"
than much of the existing MPI specification. However, because there
is a significant sub-community of applications and systems on which
faults are a real problem, this weakness will be more than offset by
the sheer necessity of running on systems that fail frequently. In
this context the fault notification API will allow MPI implementation
to bound the rate of unrecoverable failures even on unreliable
hardware platforms.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
http://greg.bronevetsky.com
At 04:55 AM 2/17/2009, Supalov, Alexander wrote:
>Hi,
>
>Thanks. I think we're coming from different corners, and the main
>problem is that the two criteria of yours, namely, "self-consistent
>specification and usefulness of the chosen abstraction level" are
>not sufficient for me. The key here appears to be in the
>"usefulness" that you seem to understand as "this can be used in
>principle on my future Petascale machine running my favorite MPI"
>and mine "to be used in a real application, an API should be useful
>in a wide, a priori known variety of practically relevant situations
>that do not depend on the MPI implementation at hand or the platform involved".
>
>Maybe we should have a call or something to discuss this live,
>without email getting into way. The rest is just an outline of what
>I'd have to say if we met.
>
>I say that we should define a standard set of the faults that will
>be detected, and then say what implementations should provide what
>level of service to be compliant with a particular level of the FT
>support that the standard is to specify in clear terms. If we don't
>do this, we will only prepare a good ground for MPI-4, were we will
>have to fix the flaws of the MPI-3. I'm afraid that at the moment
>we're driving in this very direction, and this is why.
>
>The ultimate test of a spec is a number of implementations that are
>widely used as specified. I'm concerned that the current
>notification spec is too weak to be appealing to the commercial
>users and implementors. I'm concerned about this because the
>semantics of an API that kicks in under unknown circumstances are
>surely ill-defined, and the ROI of using it in any given application
>cannot be assessed upfront. This means that the investment is
>unlikely to be made - either on the commercial MPI implementation or
>on the commercial MPI application side of the equation.
>
>In other words, why would I care to use an API if I were not aware
>of when and how this would help my application run weeks rather than
>days between the faults? And I don't mean here a student with a
>diploma or PhD thesis to write and forget about. I mean real life
>commercial developers who need to justify every bit of what they are
>doing by showing positive ROI to their management in the times of
>worldwide economic crisis. Justify by promising to earn more money
>than they are going to spend for the development, or else.
>
>Now, I can't comment on the networking topology example of yours
>because I cannot fully follow the logic. Let me try to give another,
>hopefully more practical example.
>
>Consider interaction of the MPI_Init/MPI_Finalize with the
>underlying OS and job managers. It's about as undefined as that of
>the mentioned calls with the checkpointer. Nevertheless, many usable
>implementations can cope with that quite nicely. By the way, this is
>how the checkpoint/restarting could work as well, providing the
>checkpointer at hand with the MPI ready to be checkpointed/restarted
>at a well defined point of the MPI program. After all, this is how
>it is done now: a configuration flag tells the MPI what checkpointer
>to expect. This could be refined, turned into dynamic recognition of
>the active checkpointer, etc. This is all trivial.
>
>Imagine now that we have an API instead of the MPI_Init/MPI_Finalize
>that says: "Well, it sort of starts and terminates an MPI job, but
>one cannot guarantee how many processes will be started nor whether
>they are usable after this because one cannot explain really in what
>conditions the above is true. Anyway, this is all too low level for
>the MPI standard to deal with. So be it because we say this
>interface is self-contained and useful."
>
>The provided interface is self-consistent. It sort of starts a job.
>Is it practically useful? No, because one cannot know when it works
>at all. What is the expected commercial user reaction to this
>interface? Steer clear of it or, even worse, assume on the limited
>past student experience that the MPI implementation at hand is the
>only correct one, and then blame all others that they do a sloppy
>job when the product should be ported elsewhere. Neither would be
>good for the MPI standard.
>
>Cf. process spawning and 1-sided comm, the biggest failures of the
>MPI standard to date in practical terms. Were they self-consistent
>and useful in the sense that you advocate? Sure. Were they useful in
>the sense that I advocate? No. And they duly faded into obscurity,
>because they did not pass the ultimate test even though they did
>meet the criteria that you claim are necessary and sufficient for
>the MPI standard. I pray you consider this before we exchange
>another couple of emails. I'm just trying to help.
>
>Best regards.
>
>Alexander
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Monday, February 16, 2009 5:58 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>MPI never places conditions on how the MPI implementation does its
>job. It never says whether MPI uses static or dynamic routing or how
>much performance degrades as a result of the application using a
>particular communication pattern, which depends closely on the
>physical network topology. These are simply issues that are too low
>level for the MPI spec to define although they very much matter to
>application developers. What it does instead is define a set of
>semantics for MPI_Send, MPI_Recv, etc. that apply regardless of all
>those details and accepts the compromise: its an internally
>self-consistent spec that doesn't fully define all the relevant facts
>about the system. They key thing is that it is a useful compromise.
>Thus, we have two tests for a candidate API: self-consistent
>specification and usefulness of the chosen abstraction level.
>
>The fault notification API makes exactly the same compromise as the
>overall MPI spec. It doesn't say anything about how faults are
>detected since that is a very low-level network matter. However, it
>presents a self-consistent high-level specification that allows
>applications to react to any such errors. Furthermore, it is clearly
>useful. It is a red herring to worry about which low-level events
>will cause which high-level notifications. The only relevant thing is
>the probability of unrecoverable errors. Applications do not want
>their applications to randomly abort with a frequency higher than
>once every few days or weeks. If it is higher, those unrecoverable
>failures must be converted into recoverable failures by the MPI
>library and given to the application via the fault notification API.
>This is the entire function of the fault notification API: to allow
>MPI to convert unrecoverable system failures (currently they're all
>unrecoverable) into recoverable failures. This makes it possible for
>customers to buy systems that fail relatively frequently while making
>them usable by making their applications fault tolerant. Thus, the
>fault notification API is both self-consistent and useful, passing
>both tests of the MPI spec.
>
>In contrast, the checkpointing API is useful but not self-consistent
>API. Its semantics require details (i.e. interactions with the
>checkpointer) that are too low-level to be specified in the MPI spec.
>As a result, it needs additional mechanisms that allow individual MPI
>implementations to provide the information that cannot be detailed in
>the MPI spec.
>
>Thus, these two APIs are not at all similar unless you wish to argue
>that 1. the MPI spec is ill-defined because it doesn't specify the
>network topology or that 2. the semantics of being notified of a
>fault are ill-defined. If you wish to argue the latter, I would love
>to see examples because they would need to be fixed before this API
>is ready to go before the forum.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>At 06:04 AM 2/16/2009, Supalov, Alexander wrote:
> >Thanks. I think that since the notification API does not provide any
> >guarantee as to what kind of faults is treated how, the whole thing
> >becomes a negotiation between the MPI implementation and the
> >underlying networking layers. Moreover, it becomes a negotiation of
> >sorts between the application and the MPI implementation, because
> >the application cannot know upfront what faults will be treated what way.
> >
> >This is, in my mind, is very comparable to, if not worse than the
> >negotiation between the MPI_prepare_for_checkpoint &
> >MPI_Restart_after_chekpoint implementation on one hand, and the
> >checkpointer involved on the other hand.
> >
> >Frankly, I don't see any difference here, or, if any, one in favor
> >of the checkpointing interface.
> >
> >Anyway, thanks for clarification.
> >
> >-----Original Message-----
> >From: mpi3-ft-bounces at lists.mpi-forum.org
> >[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
> >Sent: Friday, February 13, 2009 7:18 PM
> >To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
> >Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
> >Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
> >
> >At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
> > >Thanks. Could you please clarify to me, if possible, using some
> > >practically relevant example, how fault notification for a set of
> > >undefined fault types that may vary from MPI implementation to
> > >implementation differs from the equally abstract
> > >MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
> > >implementation at hand for the checkpoint action done by the
> > >checkpointing system involved, and then semantically clearly recover
> > >the MPI part of the program after the system restore?
> >
> >Simple. As you've pointed out, the checkpointing API is well defined
> >from the application's point of view. However, its semantics are weak
> >from the checkpointer's point of view. Seen from this angle, it is
> >not clear what the checkpointer can expect from the MPI library and
> >the whole thing devolves into a negotiation between individual
> >checkpointers and individual MPI libraries on a variety of specific
> >system configurations. In contrast, the fault notification API only
> >has an application view, which is in fact well-defined. The weakness
> >of the fault notification API is what you've already described, that
> >it provides no guarantees about the quality of the implementation in
> >a way that is more significant than for other portions of MPI, such
> >as network details for MPI_Send/MPI_Recv.
> >
> >Greg Bronevetsky
> >Post-Doctoral Researcher
> >1028 Building 451
> >Lawrence Livermore National Lab
> >(925) 424-5756
> >bronevetsky1 at llnl.gov
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> >---------------------------------------------------------------------
> >Intel GmbH
> >Dornacher Strasse 1
> >85622 Feldkirchen/Muenchen Germany
> >Sitz der Gesellschaft: Feldkirchen bei Muenchen
> >Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
> >Registergericht: Muenchen HRB 47456 Ust.-IdNr.
> >VAT Registration No.: DE129385895
> >Citibank Frankfurt (BLZ 502 109 00) 600119052
> >
> >This e-mail and any attachments may contain confidential material for
> >the sole use of the intended recipient(s). Any review or distribution
> >by others is strictly prohibited. If you are not the intended
> >recipient, please contact the sender and delete all copies.
> >
> >
> >_______________________________________________
> >mpi3-ft mailing list
> >mpi3-ft at lists.mpi-forum.org
> >http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the mpiwg-ft
mailing list