[Mpi3-ft] FT Levels of Support
Greg Bronevetsky
bronevetsky1 at llnl.gov
Mon Oct 27 14:36:05 CDT 2008
> - How does an application signal to the MPI implementation that it
>wishes to enable the fault tolerance features of MPI (vs.
>MPI_ERRORS_RETURN or MPI_ERRORS_FATAL)
> - What is the state of MPI after a process failure with FT enabled?
> - What is the state of communicators?
These are the major foundational questions and ones that Bronis de
Supinski has been poking me about. We need to work out a simple
tiered set of levels of support where the basic level is sane, the
medium level is functional and the highest level allows for complex
services like the ones we've been talking about.
MPI_ERRORS_FATAL
If an MPI error happens, MPI is required to return an error
to the application but the outcome of subsequent calls to MPI,
including MPI_Finalize, is undefined.
MPI_ERRORS_FAIL_ATOMIC
MPI returns an error to the application on any internal
error. Subsequent calls to each MPI routine will either succeed or
fail. If they succeed, the effect is as in a normal execution. If
they fail, the state of the non-failed portions of the application
and the MPI library is not changed. In other words, the application
may keep using the still functional portions of MPI and if it tries
to do something that doesn't work anymore, it only gets an error.
MPI_ERRORS_INTERACTIVE
Same as above, except that in addition to getting an error,
the application can interact with MPI via some kind of interface
(events or something else) to get more info about what happened and
how to avoid running into it in the future. This level of support
will provide the application with additional APIs that will help it
deal with the failure, including communicator management and
piggybacking. We may want this level to provide additional flags that
inform the application about which specific functionality is being
provided if we think that providing all the various support APIs at
the same time will be too much of a pain for library vendors.
Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov
More information about the mpiwg-ft
mailing list