[Mpi3-ft] Defining the state of MPI after an error
buntinas at mcs.anl.gov
Wed Sep 22 12:42:49 CDT 2010
On Sep 22, 2010, at 12:04 PM, Terry Dontje wrote:
> Darius Buntinas wrote:
> Ok I need a clarification here because I feel that I might be misinterpreting something. So is the CANNOT_CONTINUE error class only returned by MPI after a previous error condition has been returned that has caused problems? For example let's say we did an MPI_Bcast that resulted in a return of MPI_ERR_OP and for whatever reason the MPI library is borked. So the next call to MPI would return the CANNOT_CONTINUE error class?
Yes, I believe that's the behavior we talked about at the forum.
> So is this an escape hatch for an implementation that does not support any type of fault tolerance to explicitly notify the user they shouldn't proceed any further? I really wonder how many implementations will do such.
Well, I wouldn't say it's an escape hatch, since an implementation that doesn't support any fault tolerance needn't ever return CANNOT_CONTINUE. Because we still haven't defined what happens after an error, operation after an error is still undefined, the implementation is still free to do anything including returning MPI_SUCCESS.
However, a high quality implementation, would return CANNOT_CONTINUE on subsequent MPI calls when it knows that something is borked beyond repair.
Note that there's still a lot of middle ground between "no errors" and "totally borked". We're giving the implementation an error to return if it finds itself totally borked. This is why I agree with Josh's statement that this proposal is a minor change.
More information about the mpiwg-ft