[Mpi3-ft] The state of MPI is undefined

Darius Buntinas buntinas at mcs.anl.gov
Mon Jun 13 11:36:15 CDT 2011


I'm not sure I like the wording:

  If the MPI implementation can continue operating after process failure
  then it must return an appropriate error class (e.g.,
  MPI_ERR_RANK_FAIL_STOP) and provide the additional semantics defined
  in Chapter 17.

I don't think we can say that "if MPI returns error code X" ==> "MPI state is defined," because if some state corrupting error occurs, the state of MPI is undefined and the implementation might return error code X.  Rather, I think we should use the form "if error FOO occurs" ==> "MPI state will be BAR."  So for example, we can say "if an unrecoverable communication error occurs, the MPI implementation will return a MPI_ERR_COMMUNICATION for all pending communication operations and blah, blah, blah."

A blanket statement we can make about errors, then, would be something like "Unless otherwise noted in the standard, after an error is detected, the state of MPI is undefined."  

Another thought, we may get resistance from the Forum about the cost of checking for faults.  Maybe we discussed this already, I can't remember.  Do we want to make FT optional, and say "If the MPI implementation supports FT, then the behavior will be blah"?

-d







More information about the mpiwg-ft mailing list