[Mpi3-ft] Communicator Virtualization as a step forward

Greg Bronevetsky bronevetsky1 at llnl.gov
Mon Feb 16 10:58:16 CST 2009


MPI never places conditions on how the MPI implementation does its 
job. It never says whether MPI uses static or dynamic routing or how 
much performance degrades as a result of the application using a 
particular communication pattern, which depends closely on the 
physical network topology. These are simply issues that are too low 
level for the MPI spec to define although they very much matter to 
application developers. What it does instead is define a set of 
semantics for MPI_Send, MPI_Recv, etc. that apply regardless of all 
those details and accepts the compromise: its an internally 
self-consistent spec that doesn't fully define all the relevant facts 
about the system. They key thing is that it is a useful compromise. 
Thus, we have two tests for a candidate API: self-consistent 
specification and usefulness of the chosen abstraction level.

The fault notification API makes exactly the same compromise as the 
overall MPI spec. It doesn't say anything about how faults are 
detected since that is a very low-level network matter. However, it 
presents a self-consistent high-level specification that allows 
applications to react to any such errors. Furthermore, it is clearly 
useful. It is a red herring to worry about which low-level events 
will cause which high-level notifications. The only relevant thing is 
the probability of unrecoverable errors. Applications do not want 
their applications to randomly abort with a frequency higher than 
once every few days or weeks. If it is higher, those unrecoverable 
failures must be converted into recoverable failures by the MPI 
library and given to the application via the fault notification API. 
This is the entire function of the fault notification API: to allow 
MPI to convert unrecoverable system failures (currently they're all 
unrecoverable) into recoverable failures. This makes it possible for 
customers to buy systems that fail relatively frequently while making 
them usable by making their applications fault tolerant. Thus, the 
fault notification API is both self-consistent and useful, passing 
both tests of the MPI spec.

In contrast, the checkpointing API is useful but not self-consistent 
API. Its semantics require details (i.e. interactions with the 
checkpointer) that are too low-level to be specified in the MPI spec. 
As a result, it needs additional mechanisms that allow individual MPI 
implementations to provide the information that cannot be detailed in 
the MPI spec.

Thus, these two APIs are not at all similar unless you wish to argue 
that 1. the MPI spec is ill-defined because it doesn't specify the 
network topology or that 2. the semantics of being notified of a 
fault are ill-defined. If you wish to argue the latter, I would love 
to see examples because they would need to be fixed before this API 
is ready to go before the forum.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

At 06:04 AM 2/16/2009, Supalov, Alexander wrote:
>Thanks. I think that since the notification API does not provide any 
>guarantee as to what kind of faults is treated how, the whole thing 
>becomes a negotiation between the MPI implementation and the 
>underlying networking layers. Moreover, it becomes a negotiation of 
>sorts between the application and the MPI implementation, because 
>the application cannot know upfront what faults will be treated what way.
>
>This is, in my mind, is very comparable to, if not worse than the 
>negotiation between the MPI_prepare_for_checkpoint & 
>MPI_Restart_after_chekpoint implementation on one hand, and the 
>checkpointer involved on the other hand.
>
>Frankly, I don't see any difference here, or, if any, one in favor 
>of the checkpointing interface.
>
>Anyway, thanks for clarification.
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org 
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Friday, February 13, 2009 7:18 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working 
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
> >Thanks. Could you please clarify to me, if possible, using some
> >practically relevant example, how fault notification for a set of
> >undefined fault types that may vary from MPI implementation to
> >implementation differs from the equally abstract
> >MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
> >implementation at hand for the checkpoint action done by the
> >checkpointing system involved, and then semantically clearly recover
> >the MPI part of the program after the system restore?
>
>Simple. As you've pointed out, the checkpointing API is well defined
>from the application's point of view. However, its semantics are weak
>from the checkpointer's point of view. Seen from this angle, it is
>not clear what the checkpointer can expect from the MPI library and
>the whole thing devolves into a negotiation between individual
>checkpointers and individual MPI libraries on a variety of specific
>system configurations. In contrast, the fault notification API only
>has an application view, which is in fact well-defined. The weakness
>of the fault notification API is what you've already described, that
>it provides no guarantees about the quality of the implementation in
>a way that is more significant than for other portions of MPI, such
>as network details for MPI_Send/MPI_Recv.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft




More information about the mpiwg-ft mailing list