[Mpi3-ft] Distinguishing errors from failures
james.dinan at gmail.com
Tue Jul 16 15:23:06 CDT 2013
Hi FT WG,
I am doing my best to socialize the FT proposal at Intel and gathered a
piece of feedback to bring back to the WG.
There was a concern that any time the user registers an error handler,
fault tolerance could be "switched on" because MPI_Comm_set_errhandler()
does not distinguish between error classes. The assumption was that, when
switched on, there would be space/time costs associated with fault
tolerance. How does the current proposal determine when fault tolerance
should be enabled?
One suggested mechanism was to add a function, MPI_Comm_set_faulthandler()
that allows the programmer to distinguish between errors and failures.
This would allow the runtime to determine when fault tolerance was
desired. I think the way this is handled currently is to rely on the
implementation switching on/off fault tolerance when the job is launched.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft