[mpi3-ft] Telecon on 2/1/2008

Greg Bronevetsky bronevetsky1 at llnl.gov
Fri Jan 25 12:05:03 CST 2008


>One topic for the first meeting would be coming up with a clear 
>definition/description of "Fault Tolerance" in the context of MPI. 
>Obviously, we can refine it over email prior to teh confcall.

More specifically, I think we need to break the overall task of 
application fault tolerance into application responsibilities, MPI 
responsibilities and third party responsibilities. For example, an 
MPI responsibility may be to maintain stability in the face of 
non-catastrophic node and link failures (i.e. most nodes are alive 
and no network partition). An application responsibility may be to 
survive the failure of one or more nodes through some 
application-specific technique or through checkpointing. A 
third-party responsibility may be checkpointing, which may be 
included as part of the application or embedded into MPI but will 
probably not be specified as part of the MPI spec.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov 



More information about the mpiwg-ft mailing list