[Mpi3-ft] Communicator Virtualization as a step forward

Supalov, Alexander alexander.supalov at intel.com
Tue Feb 17 06:55:33 CST 2009


Hi,

Thanks. I think we're coming from different corners, and the main problem is that the two criteria of yours, namely, "self-consistent specification and usefulness of the chosen abstraction level" are not sufficient for me. The key here appears to be in the "usefulness" that you seem to understand as "this can be used in principle on my future Petascale machine running my favorite MPI" and mine "to be used in a real application, an API should be useful in a wide, a priori known variety of practically relevant situations that do not depend on the MPI implementation at hand or the platform involved".

Maybe we should have a call or something to discuss this live, without email getting into way. The rest is just an outline of what I'd have to say if we met.

I say that we should define a standard set of the faults that will be detected, and then say what implementations should provide what level of service to be compliant with a particular level of the FT support that the standard is to specify in clear terms. If we don't do this, we will only prepare a good ground for MPI-4, were we will have to fix the flaws of the MPI-3. I'm afraid that at the moment we're driving in this very direction, and this is why.

The ultimate test of a spec is a number of implementations that are widely used as specified. I'm concerned that the current notification spec is too weak to be appealing to the commercial users and implementors. I'm concerned about this because the semantics of an API that kicks in under unknown circumstances are surely ill-defined, and the ROI of using it in any given application cannot be assessed upfront. This means that the investment is unlikely to be made - either on the commercial MPI implementation or on the commercial MPI application side of the equation.

In other words, why would I care to use an API if I were not aware of when and how this would help my application run weeks rather than days between the faults? And I don't mean here a student with a diploma or PhD thesis to write and forget about. I mean real life commercial developers who need to justify every bit of what they are doing by showing positive ROI to their management in the times of worldwide economic crisis. Justify by promising to earn more money than they are going to spend for the development, or else.

Now, I can't comment on the networking topology example of yours because I cannot fully follow the logic. Let me try to give another, hopefully more practical example.

Consider interaction of the MPI_Init/MPI_Finalize with the underlying OS and job managers. It's about as undefined as that of the mentioned calls with the checkpointer. Nevertheless, many usable implementations can cope with that quite nicely. By the way, this is how the checkpoint/restarting could work as well, providing the checkpointer at hand with the MPI ready to be checkpointed/restarted at a well defined point of the MPI program. After all, this is how it is done now: a configuration flag tells the MPI what checkpointer to expect. This could be refined, turned into dynamic recognition of the active checkpointer, etc. This is all trivial.

Imagine now that we have an API instead of the MPI_Init/MPI_Finalize that says: "Well, it sort of starts and terminates an MPI job, but one cannot guarantee how many processes will be started nor whether they are usable after this because one cannot explain really in what conditions the above is true. Anyway, this is all too low level for the MPI standard to deal with. So be it because we say this interface is self-contained and useful."

The provided interface is self-consistent. It sort of starts a job. Is it practically useful? No, because one cannot know when it works at all. What is the expected commercial user reaction to this interface? Steer clear of it or, even worse, assume on the limited past student experience that the MPI implementation at hand is the only correct one, and then blame all others that they do a sloppy job when the product should be ported elsewhere. Neither would be good for the MPI standard.

Cf. process spawning and 1-sided comm, the biggest failures of the MPI standard to date in practical terms. Were they self-consistent and useful in the sense that you advocate? Sure. Were they useful in the sense that I advocate? No. And they duly faded into obscurity, because they did not pass the ultimate test even though they did meet the criteria that you claim are necessary and sufficient for the MPI standard. I pray you consider this before we exchange another couple of emails. I'm just trying to help.

Best regards.

Alexander

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
Sent: Monday, February 16, 2009 5:58 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward

MPI never places conditions on how the MPI implementation does its 
job. It never says whether MPI uses static or dynamic routing or how 
much performance degrades as a result of the application using a 
particular communication pattern, which depends closely on the 
physical network topology. These are simply issues that are too low 
level for the MPI spec to define although they very much matter to 
application developers. What it does instead is define a set of 
semantics for MPI_Send, MPI_Recv, etc. that apply regardless of all 
those details and accepts the compromise: its an internally 
self-consistent spec that doesn't fully define all the relevant facts 
about the system. They key thing is that it is a useful compromise. 
Thus, we have two tests for a candidate API: self-consistent 
specification and usefulness of the chosen abstraction level.

The fault notification API makes exactly the same compromise as the 
overall MPI spec. It doesn't say anything about how faults are 
detected since that is a very low-level network matter. However, it 
presents a self-consistent high-level specification that allows 
applications to react to any such errors. Furthermore, it is clearly 
useful. It is a red herring to worry about which low-level events 
will cause which high-level notifications. The only relevant thing is 
the probability of unrecoverable errors. Applications do not want 
their applications to randomly abort with a frequency higher than 
once every few days or weeks. If it is higher, those unrecoverable 
failures must be converted into recoverable failures by the MPI 
library and given to the application via the fault notification API. 
This is the entire function of the fault notification API: to allow 
MPI to convert unrecoverable system failures (currently they're all 
unrecoverable) into recoverable failures. This makes it possible for 
customers to buy systems that fail relatively frequently while making 
them usable by making their applications fault tolerant. Thus, the 
fault notification API is both self-consistent and useful, passing 
both tests of the MPI spec.

In contrast, the checkpointing API is useful but not self-consistent 
API. Its semantics require details (i.e. interactions with the 
checkpointer) that are too low-level to be specified in the MPI spec. 
As a result, it needs additional mechanisms that allow individual MPI 
implementations to provide the information that cannot be detailed in 
the MPI spec.

Thus, these two APIs are not at all similar unless you wish to argue 
that 1. the MPI spec is ill-defined because it doesn't specify the 
network topology or that 2. the semantics of being notified of a 
fault are ill-defined. If you wish to argue the latter, I would love 
to see examples because they would need to be fixed before this API 
is ready to go before the forum.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 06:04 AM 2/16/2009, Supalov, Alexander wrote:
>Thanks. I think that since the notification API does not provide any 
>guarantee as to what kind of faults is treated how, the whole thing 
>becomes a negotiation between the MPI implementation and the 
>underlying networking layers. Moreover, it becomes a negotiation of 
>sorts between the application and the MPI implementation, because 
>the application cannot know upfront what faults will be treated what way.
>
>This is, in my mind, is very comparable to, if not worse than the 
>negotiation between the MPI_prepare_for_checkpoint & 
>MPI_Restart_after_chekpoint implementation on one hand, and the 
>checkpointer involved on the other hand.
>
>Frankly, I don't see any difference here, or, if any, one in favor 
>of the checkpointing interface.
>
>Anyway, thanks for clarification.
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org 
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Friday, February 13, 2009 7:18 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
> >Thanks. Could you please clarify to me, if possible, using some
> >practically relevant example, how fault notification for a set of
> >undefined fault types that may vary from MPI implementation to
> >implementation differs from the equally abstract
> >MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
> >implementation at hand for the checkpoint action done by the
> >checkpointing system involved, and then semantically clearly recover
> >the MPI part of the program after the system restore?
>
>Simple. As you've pointed out, the checkpointing API is well defined
>from the application's point of view. However, its semantics are weak
>from the checkpointer's point of view. Seen from this angle, it is
>not clear what the checkpointer can expect from the MPI library and
>the whole thing devolves into a negotiation between individual
>checkpointers and individual MPI libraries on a variety of specific
>system configurations. In contrast, the fault notification API only
>has an application view, which is in fact well-defined. The weakness
>of the fault notification API is what you've already described, that
>it provides no guarantees about the quality of the implementation in
>a way that is more significant than for other portions of MPI, such
>as network details for MPI_Send/MPI_Recv.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.





More information about the mpiwg-ft mailing list