[Mpi3-ft] Defining the state of MPI after an error

Terry Dontje terry.dontje at oracle.com
Wed Sep 22 12:59:32 CDT 2010

Darius Buntinas wrote:
> On Sep 22, 2010, at 12:04 PM, Terry Dontje wrote:
>> Darius Buntinas wrote:
>> Ok I need a clarification here because I feel that I might be misinterpreting something.  So is the CANNOT_CONTINUE error class only returned by MPI after a previous error condition has been returned that has caused problems?  For example let's say we did an MPI_Bcast that resulted in a return of MPI_ERR_OP and for whatever reason the MPI library is borked.  So the next call to MPI would return the CANNOT_CONTINUE error class?
> Yes, I believe that's the behavior we talked about at the forum.
>> So is this an escape hatch for an implementation that does not support any type of fault tolerance to explicitly notify the user they shouldn't proceed any further?  I really wonder how many implementations will do such.  
> Well, I wouldn't say it's an escape hatch, since an implementation that doesn't support any fault tolerance needn't ever return CANNOT_CONTINUE.  Because we still haven't defined what happens after an error, operation after an error is still undefined, the implementation is still free to do anything including returning MPI_SUCCESS.
> However, a high quality implementation, would return CANNOT_CONTINUE on subsequent MPI calls when it knows that something is borked beyond repair.
> Note that there's still a lot of middle ground between "no errors" and "totally borked".  We're giving the implementation an error to return if it finds itself totally borked.  This is why I agree with Josh's statement that this proposal is a minor change.
That's funny, because I was thinking a high quality implementation would 
never return CANNOT_CONTINUE but more distinct error codes that lets an 
application to recover.  I would think very few errors would actually 
completely obliterate an MPI library's internal structures.  At least 
the implementations I've seen that's seems to be the case.

> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje at oracle.com <mailto:terry.dontje at oracle.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 2059 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.gif>

More information about the mpiwg-ft mailing list