[Mpi3-ft] Defining the state of MPI after an error
Terry Dontje
terry.dontje at oracle.com
Wed Sep 22 12:59:32 CDT 2010
Darius Buntinas wrote:
> On Sep 22, 2010, at 12:04 PM, Terry Dontje wrote:
>
>
>> Darius Buntinas wrote:
>> Ok I need a clarification here because I feel that I might be misinterpreting something. So is the CANNOT_CONTINUE error class only returned by MPI after a previous error condition has been returned that has caused problems? For example let's say we did an MPI_Bcast that resulted in a return of MPI_ERR_OP and for whatever reason the MPI library is borked. So the next call to MPI would return the CANNOT_CONTINUE error class?
>>
>
> Yes, I believe that's the behavior we talked about at the forum.
>
>
>> So is this an escape hatch for an implementation that does not support any type of fault tolerance to explicitly notify the user they shouldn't proceed any further? I really wonder how many implementations will do such.
>>
>>
>
> Well, I wouldn't say it's an escape hatch, since an implementation that doesn't support any fault tolerance needn't ever return CANNOT_CONTINUE. Because we still haven't defined what happens after an error, operation after an error is still undefined, the implementation is still free to do anything including returning MPI_SUCCESS.
>
> However, a high quality implementation, would return CANNOT_CONTINUE on subsequent MPI calls when it knows that something is borked beyond repair.
>
> Note that there's still a lot of middle ground between "no errors" and "totally borked". We're giving the implementation an error to return if it finds itself totally borked. This is why I agree with Josh's statement that this proposal is a minor change.
>
>
That's funny, because I was thinking a high quality implementation would
never return CANNOT_CONTINUE but more distinct error codes that lets an
application to recover. I would think very few errors would actually
completely obliterate an MPI library's internal structures. At least
the implementations I've seen that's seems to be the case.
--td
> -d
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.dontje at oracle.com <mailto:terry.dontje at oracle.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 2059 bytes
Desc: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/3e36049e/attachment-0001.gif>
More information about the mpiwg-ft
mailing list