[Mpi3-ft] FT Levels of Support

Greg Bronevetsky bronevetsky1 at llnl.gov
Mon Oct 27 14:36:05 CDT 2008

>  - How does an application signal to the MPI implementation that it
>wishes to enable the fault tolerance features of MPI (vs.
>  - What is the state of MPI after a process failure with FT enabled?
>  - What is the state of communicators?
These are the major foundational questions and ones that Bronis de 
Supinski has been poking me about. We need to work out a simple 
tiered set of levels of support where the basic level is sane, the 
medium level is functional and the highest level allows for complex 
services like the ones we've been talking about.

         If an MPI error happens, MPI is required to return an error 
to the application but the outcome of subsequent calls to MPI, 
including MPI_Finalize, is undefined.

         MPI returns an error to the application on any internal 
error. Subsequent calls to each MPI routine will either succeed or 
fail. If they succeed, the effect is as in a normal execution. If 
they fail, the state of the non-failed portions of the application 
and the MPI library is not changed. In other words, the application 
may keep using the still functional portions of MPI and if it tries 
to do something that doesn't work anymore, it only gets an error.

         Same as above, except that in addition to getting an error, 
the application can interact with MPI via some kind of interface 
(events or something else) to get more info about what happened and 
how to avoid running into it in the future. This level of support 
will provide the application with additional APIs that will help it 
deal with the failure, including communicator management and 
piggybacking. We may want this level to provide additional flags that 
inform the application about which specific functionality is being 
provided if we think that providing all the various support APIs at 
the same time will be too much of a pain for library vendors.

Greg Bronevetsky
Post-Doctoral Researcher
1028 Building 451
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky1 at llnl.gov 

More information about the mpiwg-ft mailing list