[mpiwg-ft] Aborting When Error Handlers are Non-Uniform

Bland, Wesley wesley.bland at intel.com
Thu Feb 22 11:32:27 CST 2018


Jim Dinan pointed out an important use case for error handlers that we're not solving in our current proposals (#1 & #3) that we might want to consider:

Because error handlers are not uniformly set on all processes in a communicator, it's possible that some processes might set MPI_ERRORS_ARE_FATAL and others would set MPI_ERRORS_RETURN. This would allow a process that has set FATAL to kill a process that has set RETURN. The place where this might particularly bad is in a connect/accept app where the client has the default ABORT and the server has set RETURN. A bad client can kill a good server.

A solution to this would be to say that the error handlers signal to all connected processes (or processes in the communicator in the case of MPI_ERRORS_ABORT) that they want to abort, but each MPI process should consult its own error handler before actually deciding whether or not to abort itself.

The current spec seems a bit fuzzy on this, but I believe our proposal actively prevents this by couching the definition of these error handlers in the definition of MPI_ABORT. We could solve this by changing the specification of these error handlers to be something more like what I have above. This, of course, would mean that we'd withdraw the proposal from a vote next week if we wanted to go that route.


More information about the mpiwg-ft mailing list