[Mpi3-ft] The state of MPI is undefined
Josh Hursey
jjhursey at open-mpi.org
Mon Jun 13 10:34:39 CDT 2011
I included the text from 8.3 so there would be more context in the
email regarding what the standard currently says. I thought the plan
was to have the modification/clarification early on then leave the
text in 8.3 pretty much as is. If the 8.3 test needs to be adjusted
then we can do that. The hope was that having an earlier clarification
we could avoid duplication much of the text throughout the document.
-- Josh
On Mon, Jun 13, 2011 at 11:31 AM, Howard Pritchard <howardp at cray.com> wrote:
> Hi Josh,
>
> I think your proposed change should be fine. Did you intend to
> include a proposed change to 8.3 here as well? If so, its
> missing.
>
> Howard
>
> Josh Hursey wrote:
>> Here is a suggested paragraph that should probably go in a modified
>> version of the existing section 2.8 wording below.
>>
>> What do folks think about this? Do we need more, less, or something different?
>>
>> Thanks,
>> Josh
>>
>>
>> Discussion:
>> ------------------------------------
>> So MPI talks about a couple high level erroneous behavior:
>> - Section 2.8 "an erroneous MPI call" meaning an MPI call of bad form
>> with regard to arguments and matching rules.
>> For example, a sending an integer and receiving a float - since
>> datatypes are not used in message matching.
>> - Section 2.8 "reliable message transmission" MPI must mask any
>> instability in the networking stack.
>> - Section 2.8 "processor failures" - nothing defined (no clarification)
>> - Section 8.3 "error detected" could be an erroneous MPI call or
>> internal error.
>>
>> We would like to say that if the MPI implementation returns
>> MPI_ERR_RANK_FAIL stop then they are required to provide the semantics
>> from Chapter 17. At the point they can not provide those semantics
>> then they must return some other error.
>>
>>
>> Suggest:
>> ------------------------------------
>> MPI provides the user with reliable message transmission.
>> A message sent is always received correctly, and the user does not
>> need to check for transmission errors, time-outs, or other error
>> conditions.
>> In other words, MPI does not provide mechanisms for dealing with
>> failures in the communication system.
>> If the MPI implementation is built on an unreliable underlying
>> mechanism, then it is the job of the implementor of the MPI subsystem
>> to insulate the user from this unreliability, or to reflect
>> unrecoverable errors as failures.
>> Whenever possible, such failures will be reflected as errors in the
>> relevant communication call.
>> [REMOVE: Similarly, MPI itself provides no mechanisms for handling
>> processor failures.]
>>
>> MPI does not provide the user with transparent process recovery upon
>> process failure.
>> Once a process fails, MPI does not guarantee that the job can continue
>> or, if the job can continue, that the process can be recovered.
>> If the MPI implementation can continue operating after process failure
>> then it must return an appropriate error class (e.g.,
>> MPI_ERR_RANK_FAIL_STOP) and provide the additional semantics defined
>> in Chapter 17.
>> The MPI implementation documentation will provide information on the
>> possible effect of each supported class of errors.
>>
>>
>> Section 2.8 (Existing)
>> ------------------------------------
>> MPI provides the user with reliable message transmission.
>> A message sent is always received correctly, and the user does not
>> need to check for transmission errors, time-outs, or other error
>> conditions.
>> In other words, MPI does not provide mechanisms for dealing with
>> failures in the communication system.
>> If the MPI implementation is built on an unreliable underlying
>> mechanism, then it is the job of the implementor of the MPI subsystem
>> to insulate the user from this unreliability, or to reflect
>> unrecoverable errors as failures.
>> Whenever possible, such failures will be reflected as errors in the
>> relevant communication call.
>> Similarly, MPI itself provides no mechanisms for handling processor failures.
>>
>> ...
>>
>> This document does not specify the state of a computation after an
>> erroneous MPI call has occurred.
>> The desired behavior is that a relevant error code be returned, and
>> the effect of the error be localized to the greatest possible extent.
>> E.g., it is highly desirable that an erroneous receive call will not
>> cause any part of the receiver’s memory to be overwritten, beyond the
>> area specified for receiving the message.
>>
>> Implementations may go beyond this document in supporting in a
>> meaningful manner MPI calls that are defined here to be erroneous.
>> For example, MPI specifies strict type matching rules between matching
>> send and receive operations: it is erroneous to send a floating point
>> variable and receive an integer.
>> Implementations may go beyond these type matching rules, and provide
>> automatic type conversion in such situations.
>> It will be helpful to generate warnings for such non-conforming behavior.
>>
>>
>> Section 8.3 (Existing)
>> ------------------------------------
>> After an error is detected, the state of MPI is undefined.
>> That is, using a user-defined error handler, or MPI_ERRORS_RETURN,
>> does not necessarily allow the user to continue to use MPI after an
>> error is detected.
>> The purpose of these error handlers is to allow a user to issue
>> user-defined error messages and to take actions unrelated to MPI (such
>> as flushing I/O buffers) before a program exits.
>> An MPI implementation is free to allow MPI to continue after an error
>> but is not required to do so.
>>
>> Advice to implementors.
>> A good quality implementation will, to the greatest possible extent,
>> circumscribe the impact of an error, so that normal processing can
>> continue after an error handler was invoked.
>> The implementation documentation will provide information on the
>> possible effect of each class of errors.
>> (End of advice to implementors.)
>>
>>
>>
>> On Wed, Jun 8, 2011 at 2:35 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
>>> Per our conversation today, we wanted to have a paragraph clearly
>>> defining what the MPI standard means by 'After an error is detected,
>>> the state of MPI is undefined.'. Since it is defined for some classes
>>> of errors. The paragraph would clarify further references of this
>>> nature in the MPI standard.
>>>
>>> Note that this is slightly different than when the program (code) is
>>> erroneous due to misuse of the MPI standard interfaces.
>>>
>>> A few places in the text to look:
>>> - Section 2.8: Error Handling - Paragraph 1 and 6
>>> - Section 8.3: Error Handling - Paragraphs 6 and 7.
>>> - Section 13.7: I/O Error Handling - Advice to users
>>>
>>> If the MPI implementation returns an error of MPI_ERR_RANK_FAIL_STOP
>>> then it must provide the semantics defined in Chapter 17. We are not,
>>> at this time, defining the semantic behavior of the MPI standard after
>>> returning other errors.
>>>
>>> Any suggestions on possible wording?
>>> Something like "The state of the computation after an error has
>>> occurred may be undefined. A high-quality implementation will continue
>>> afterwards. IF the implementation returns an error and the semantics
>>> after the error are defined in the standard (e.g.,
>>> MPI_ERR_RANK_FAIL_STOP in Chapter 17), then the implementation must
>>> provide the specified semantics."
>>>
>>> Any suggestions on where to put the wording?
>>> It was suggested that we change/update paragraphs 6 and 7 in Section
>>> 8.3 appropriately.
>>>
>>>
>>> Thoughts,
>>> Josh
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>>
>>
>>
>>
>
>
> --
> Howard Pritchard
> Software Engineering
> Cray, Inc.
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>
>
--
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey
More information about the mpiwg-ft
mailing list