[Mpi3-ft] Defining the state of MPI after an error

Wed Sep 22 12:42:49 CDT 2010

On Sep 22, 2010, at 12:04 PM, Terry Dontje wrote:

> Darius Buntinas wrote:
> Ok I need a clarification here because I feel that I might be misinterpreting something.  So is the CANNOT_CONTINUE error class only returned by MPI after a previous error condition has been returned that has caused problems?  For example let's say we did an MPI_Bcast that resulted in a return of MPI_ERR_OP and for whatever reason the MPI library is borked.  So the next call to MPI would return the CANNOT_CONTINUE error class?

Yes, I believe that's the behavior we talked about at the forum.

> So is this an escape hatch for an implementation that does not support any type of fault tolerance to explicitly notify the user they shouldn't proceed any further?  I really wonder how many implementations will do such.  
> 

Well, I wouldn't say it's an escape hatch, since an implementation that doesn't support any fault tolerance needn't ever return CANNOT_CONTINUE.  Because we still haven't defined what happens after an error, operation after an error is still undefined, the implementation is still free to do anything including returning MPI_SUCCESS.

However, a high quality implementation, would return CANNOT_CONTINUE on subsequent MPI calls when it knows that something is borked beyond repair.

Note that there's still a lot of middle ground between "no errors" and "totally borked".  We're giving the implementation an error to return if it finds itself totally borked.  This is why I agree with Josh's statement that this proposal is a minor change.

-d