[Mpi3-ft] Communicator Virtualization as a step forward

Supalov, Alexander alexander.supalov at intel.com
Mon Feb 16 08:04:07 CST 2009


Thanks. I think that since the notification API does not provide any guarantee as to what kind of faults is treated how, the whole thing becomes a negotiation between the MPI implementation and the underlying networking layers. Moreover, it becomes a negotiation of sorts between the application and the MPI implementation, because the application cannot know upfront what faults will be treated what way.

This is, in my mind, is very comparable to, if not worse than the negotiation between the MPI_prepare_for_checkpoint & MPI_Restart_after_chekpoint implementation on one hand, and the checkpointer involved on the other hand.

Frankly, I don't see any difference here, or, if any, one in favor of the checkpointing interface.

Anyway, thanks for clarification.

-----Original Message-----
From: mpi3-ft-bounces at lists.mpi-forum.org [mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Greg Bronevetsky
Sent: Friday, February 13, 2009 7:18 PM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group; MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Communicator Virtualization as a step forward

At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
>Thanks. Could you please clarify to me, if possible, using some
>practically relevant example, how fault notification for a set of
>undefined fault types that may vary from MPI implementation to
>implementation differs from the equally abstract
>MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
>implementation at hand for the checkpoint action done by the
>checkpointing system involved, and then semantically clearly recover
>the MPI part of the program after the system restore?

Simple. As you've pointed out, the checkpointing API is well defined
from the application's point of view. However, its semantics are weak
from the checkpointer's point of view. Seen from this angle, it is
not clear what the checkpointer can expect from the MPI library and
the whole thing devolves into a negotiation between individual
checkpointers and individual MPI libraries on a variety of specific
system configurations. In contrast, the fault notification API only
has an application view, which is in fact well-defined. The weakness
of the fault notification API is what you've already described, that
it provides no guarantees about the quality of the implementation in
a way that is more significant than for other portions of MPI, such
as network details for MPI_Send/MPI_Recv.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
---------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen Germany
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 Ust.-IdNr.
VAT Registration No.: DE129385895
Citibank Frankfurt (BLZ 502 109 00) 600119052

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.





More information about the mpiwg-ft mailing list