[Mpi3-ft] FT Levels of Support
schulzm at llnl.gov
Wed Oct 29 17:39:26 CDT 2008
Aren't the second (MPI_ERRORS_FAIL_ATOMIC) and third (MPI_ERRORS_INTERACTIVE)
option pretty much the same? The extra APIs that you can call in the third
case need to be there in all cases, otherwise a code that potentially wants
to use them if the MPI implements the third option would not link. These
APIs would return "not implemented" in the first two options, i.e., they
return an error, which is the same behavior required by option 2.
In addition, I don't think it is a good idea to put things like the
piggybacking in such an optional set of functionality. Piggybacking
is a fundamental set of calls that will be used beyond just providing
FT. Hence, it should also be implemented even if the MPI does not
provide advanced FT mechanisms.
At 12:36 PM 10/27/2008, Greg Bronevetsky wrote:
>> - How does an application signal to the MPI implementation that it
>>wishes to enable the fault tolerance features of MPI (vs.
>>MPI_ERRORS_RETURN or MPI_ERRORS_FATAL)
>> - What is the state of MPI after a process failure with FT enabled?
>> - What is the state of communicators?
>These are the major foundational questions and ones that Bronis de
>Supinski has been poking me about. We need to work out a simple
>tiered set of levels of support where the basic level is sane, the
>medium level is functional and the highest level allows for complex
>services like the ones we've been talking about.
> If an MPI error happens, MPI is required to return an error
> to the application but the outcome of subsequent calls to MPI,
> including MPI_Finalize, is undefined.
> MPI returns an error to the application on any internal
> error. Subsequent calls to each MPI routine will either succeed or
> fail. If they succeed, the effect is as in a normal execution. If
> they fail, the state of the non-failed portions of the application
> and the MPI library is not changed. In other words, the application
> may keep using the still functional portions of MPI and if it tries
> to do something that doesn't work anymore, it only gets an error.
> Same as above, except that in addition to getting an error,
> the application can interact with MPI via some kind of interface
> (events or something else) to get more info about what happened and
> how to avoid running into it in the future. This level of support
> will provide the application with additional APIs that will help it
> deal with the failure, including communicator management and
> piggybacking. We may want this level to provide additional flags
> that inform the application about which specific functionality is
> being provided if we think that providing all the various support
> APIs at the same time will be too much of a pain for library vendors.
>1028 Building 451
>Lawrence Livermore National Lab
>bronevetsky1 at llnl.gov
>mpi3-ft mailing list
>mpi3-ft at lists.mpi-forum.org
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulz6
CASC @ Lawrence Livermore National Laboratory, Livermore, USA
More information about the mpiwg-ft