[Mpi3-ft] Communicator Virtualization as a step forward

Fri Feb 13 12:17:55 CST 2009

At 10:10 AM 2/13/2009, Supalov, Alexander wrote:
>Thanks. Could you please clarify to me, if possible, using some 
>practically relevant example, how fault notification for a set of 
>undefined fault types that may vary from MPI implementation to 
>implementation differs from the equally abstract 
>MPI_Checkpoint/MPI_Restart that semantically clearly prepare the MPI 
>implementation at hand for the checkpoint action done by the 
>checkpointing system involved, and then semantically clearly recover 
>the MPI part of the program after the system restore?

Simple. As you've pointed out, the checkpointing API is well defined 
from the application's point of view. However, its semantics are weak 
from the checkpointer's point of view. Seen from this angle, it is 
not clear what the checkpointer can expect from the MPI library and 
the whole thing devolves into a negotiation between individual 
checkpointers and individual MPI libraries on a variety of specific 
system configurations. In contrast, the fault notification API only 
has an application view, which is in fact well-defined. The weakness 
of the fault notification API is what you've already described, that 
it provides no guarantees about the quality of the implementation in 
a way that is more significant than for other portions of MPI, such 
as network details for MPI_Send/MPI_Recv.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov