[Mpi3-ft] Communicator Virtualization as a step forward
Greg Bronevetsky
bronevetsky1 at llnl.gov
Fri Feb 13 12:17:55 CST 2009
At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
>Thanks. Could you please clarify to me, if possible, using some
>practically relevant example, how fault notification for a set of
>undefined fault types that may vary from MPI implementation to
>implementation differs from the equally abstract
>MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI
>implementation at hand for the checkpoint action done by the
>checkpointing system involved, and then semantically clearly recover
>the MPI part of the program after the system restore?
Simple. As you've pointed out, the checkpointing API is well defined
from the application's point of view. However, its semantics are weak
from the checkpointer's point of view. Seen from this angle, it is
not clear what the checkpointer can expect from the MPI library and
the whole thing devolves into a negotiation between individual
checkpointers and individual MPI libraries on a variety of specific
system configurations. In contrast, the fault notification API only
has an application view, which is in fact well-defined. The weakness
of the fault notification API is what you've already described, that
it provides no guarantees about the quality of the implementation in
a way that is more significant than for other portions of MPI, such
as network details for MPI_Send/MPI_Recv.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
More information about the mpiwg-ft
mailing list