[Mpi3-ft] Defining the state of MPI after an error

Terry Dontje terry.dontje at oracle.com
Wed Sep 22 12:59:32 CDT 2010


Darius Buntinas wrote:
> On Sep 22, 2010, at 12:04 PM, Terry Dontje wrote:
>
>   
>> Darius Buntinas wrote:
>> Ok I need a clarification here because I feel that I might be misinterpreting something.  So is the CANNOT_CONTINUE error class only returned by MPI after a previous error condition has been returned that has caused problems?  For example let's say we did an MPI_Bcast that resulted in a return of MPI_ERR_OP and for whatever reason the MPI library is borked.  So the next call to MPI would return the CANNOT_CONTINUE error class?
>>     
>
> Yes, I believe that's the behavior we talked about at the forum.
>
>   
>> So is this an escape hatch for an implementation that does not support any type of fault tolerance to explicitly notify the user they shouldn't proceed any further?  I really wonder how many implementations will do such.  
>>
>>     
>
> Well, I wouldn't say it's an escape hatch, since an implementation that doesn't support any fault tolerance needn't ever return CANNOT_CONTINUE.  Because we still haven't defined what happens after an error, operation after an error is still undefined, the implementation is still free to do anything including returning MPI_SUCCESS.
>
> However, a high quality implementation, would return CANNOT_CONTINUE on subsequent MPI calls when it knows that something is borked beyond repair.
>
> Note that there's still a lot of middle ground between "no errors" and "totally borked".  We're giving the implementation an error to return if it finds itself totally borked.  This is why I agree with Josh's statement that this proposal is a minor change.
>
>   
That's funny, because I was thinking a high quality implementation would 
never return CANNOT_CONTINUE but more distinct error codes that lets an 
application to recover.  I would think very few errors would actually 
completely obliterate an MPI library's internal structures.  At least 
the implementations I've seen that's seems to be the case.

--td
> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>   


-- 
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje at oracle.com <mailto:terry.dontje at oracle.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 2059 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.gif>


More information about the mpiwg-ft mailing list