[mpi3-ft] Telecon on 2/1/2008
Greg Bronevetsky
bronevetsky1 at llnl.gov
Fri Jan 25 12:05:03 CST 2008
>One topic for the first meeting would be coming up with a clear
>definition/description of "Fault Tolerance" in the context of MPI.
>Obviously, we can refine it over email prior to teh confcall.
More specifically, I think we need to break the overall task of
application fault tolerance into application responsibilities, MPI
responsibilities and third party responsibilities. For example, an
MPI responsibility may be to maintain stability in the face of
non-catastrophic node and link failures (i.e. most nodes are alive
and no network partition). An application responsibility may be to
survive the failure of one or more nodes through some
application-specific technique or through checkpointing. A
third-party responsibility may be checkpointing, which may be
included as part of the application or embedded into MPI but will
probably not be specified as part of the MPI spec.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
More information about the mpiwg-ft
mailing list