[Mpi3-ft] Distinguishing errors from failures

Tue Jul 16 15:23:06 CDT 2013

Hi FT WG,

I am doing my best to socialize the FT proposal at Intel and gathered a
piece of feedback to bring back to the WG.

There was a concern that any time the user registers an error handler,
fault tolerance could be "switched on" because MPI_Comm_set_errhandler()
does not distinguish between error classes.  The assumption was that, when
switched on, there would be space/time costs associated with fault
tolerance.  How does the current proposal determine when fault tolerance
should be enabled?

One suggested mechanism was to add a function, MPI_Comm_set_faulthandler()
that allows the programmer to distinguish between errors and failures.
 This would allow the runtime to determine when fault tolerance was
desired.  I think the way this is handled currently is to rely on the
implementation switching on/off fault tolerance when the job is launched.

 ~Jim.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20130716/42312e38/attachment.html>