[Mpi3-ft] Communicator Virtualization as a step forward
Greg Bronevetsky
bronevetsky1 at llnl.gov
Mon Feb 16 10:58:16 CST 2009
MPI never places conditions on how the MPI implementation does its
job. It never says whether MPI uses static or dynamic routing or how
much performance degrades as a result of the application using a
particular communication pattern, which depends closely on the
physical network topology. These are simply issues that are too low
level for the MPI spec to define although they very much matter to
application developers. What it does instead is define a set of
semantics for MPI_Send, MPI_Recv, etc. that apply regardless of all
those details and accepts the compromise: its an internally
self-consistent spec that doesn't fully define all the relevant facts
about the system. They key thing is that it is a useful compromise.
Thus, we have two tests for a candidate API: self-consistent
specification and usefulness of the chosen abstraction level.
The fault notification API makes exactly the same compromise as the
overall MPI spec. It doesn't say anything about how faults are
detected since that is a very low-level network matter. However, it
presents a self-consistent high-level specification that allows
applications to react to any such errors. Furthermore, it is clearly
useful. It is a red herring to worry about which low-level events
will cause which high-level notifications. The only relevant thing is
the probability of unrecoverable errors. Applications do not want
their applications to randomly abort with a frequency higher than
once every few days or weeks. If it is higher, those unrecoverable
failures must be converted into recoverable failures by the MPI
library and given to the application via the fault notification API.
This is the entire function of the fault notification API: to allow
MPI to convert unrecoverable system failures (currently they're all
unrecoverable) into recoverable failures. This makes it possible for
customers to buy systems that fail relatively frequently while making
them usable by making their applications fault tolerant. Thus, the
fault notification API is both self-consistent and useful, passing
both tests of the MPI spec.
In contrast, the checkpointing API is useful but not self-consistent
API. Its semantics require details (i.e. interactions with the
checkpointer) that are too low-level to be specified in the MPI spec.
As a result, it needs additional mechanisms that allow individual MPI
implementations to provide the information that cannot be detailed in
the MPI spec.
Thus, these two APIs are not at all similar unless you wish to argue
that 1. the MPI spec is ill-defined because it doesn't specify the
network topology or that 2. the semantics of being notified of a
fault are ill-defined. If you wish to argue the latter, I would love
to see examples because they would need to be fixed before this API
is ready to go before the forum.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
At 06:04 AM 2/16/2009, Supalov, Alexander wrote:
>Thanks. I think that since the notification API does not provide any
>guarantee as to what kind of faults is treated how, the whole thing
>becomes a negotiation between the MPI implementation and the
>underlying networking layers. Moreover, it becomes a negotiation of
>sorts between the application and the MPI implementation, because
>the application cannot know upfront what faults will be treated what way.
>
>This is, in my mind, is very comparable to, if not worse than the
>negotiation between the MPI_prepare_for_checkpoint &
>MPI_Restart_after_chekpoint implementation on one hand, and the
>checkpointer involved on the other hand.
>
>Frankly, I don't see any difference here, or, if any, one in favor
>of the checkpointing interface.
>
>Anyway, thanks for clarification.
>
>-----Original Message-----
>From: mpi3-ft-bounces at lists.mpi-forum.org
>[mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
>Sent: Friday, February 13, 2009 7:18 PM
>To: MPI 3.0 Fault Tolerance and Dynamic Process Control working
>Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward
>
>At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
> >Thanks. Could you please clarify to me, if possible, using some
> >practically relevant example, how fault notification for a set of
> >undefined fault types that may vary from MPI implementation to
> >implementation differs from the equally abstract
> >MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
> >implementation at hand for the checkpoint action done by the
> >checkpointing system involved, and then semantically clearly recover
> >the MPI part of the program after the system restore?
>
>Simple. As you've pointed out, the checkpointing API is well defined
>from the application's point of view. However, its semantics are weak
>from the checkpointer's point of view. Seen from this angle, it is
>not clear what the checkpointer can expect from the MPI library and
>the whole thing devolves into a negotiation between individual
>checkpointers and individual MPI libraries on a variety of specific
>system configurations. In contrast, the fault notification API only
>has an application view, which is in fact well-defined. The weakness
>of the fault notification API is what you've already described, that
>it provides no guarantees about the quality of the implementation in
>a way that is more significant than for other portions of MPI, such
>as network details for MPI_Send/MPI_Recv.
>
>Greg Bronevetsky
>Post-Doctoral Researcher
>1028 Building 451
>Lawrence Livermore National Lab
>(925) 424-5756
>bronevetsky1 at llnl.gov
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>---------------------------------------------------------------------
>Intel GmbH
>Dornacher Strasse 1
>85622 Feldkirchen/Muenchen Germany
>Sitz der Gesellschaft: Feldkirchen bei Muenchen
>Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
>Registergericht: Muenchen HRB 47456 Ust.-IdNr.
>VAT Registration No.: DE129385895
>Citibank Frankfurt (BLZ 502 109 00) 600119052
>
>This e-mail and any attachments may contain confidential material for
>the sole use of the intended recipient(s). Any review or distribution
>by others is strictly prohibited. If you are not the intended
>recipient, please contact the sender and delete all copies.
>
>
>_______________________________________________
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
More information about the mpiwg-ft
mailing list