[Mpi3-ft] The state of MPI is undefined

Howard Pritchard howardp at cray.com
Mon Jun 13 10:31:11 CDT 2011


Hi Josh,

I think your proposed change should be fine.  Did you intend to
include a proposed change to 8.3 here as well?  If so, its
missing.

Howard

Josh Hursey wrote:
> Here is a suggested paragraph that should probably go in a modified
> version of the existing section 2.8 wording below.
> 
> What do folks think about this? Do we need more, less, or something different?
> 
> Thanks,
> Josh
> 
> 
> Discussion:
> ------------------------------------
> So MPI talks about a couple high level erroneous behavior:
>  - Section 2.8 "an erroneous MPI call" meaning an MPI call of bad form
> with regard to arguments and matching rules.
> For example, a sending an integer and receiving a float - since
> datatypes are not used in message matching.
>  - Section 2.8 "reliable message transmission" MPI must mask any
> instability in the networking stack.
>  - Section 2.8 "processor failures" - nothing defined (no clarification)
>  - Section 8.3 "error detected" could be an erroneous MPI call or
> internal error.
> 
> We would like to say that if the MPI implementation returns
> MPI_ERR_RANK_FAIL stop then they are required to provide the semantics
> from Chapter 17. At the point they can not provide those semantics
> then they must return some other error.
> 
> 
> Suggest:
> ------------------------------------
> MPI provides the user with reliable message transmission.
> A message sent is always received correctly, and the user does not
> need to check for transmission errors, time-outs, or other error
> conditions.
> In other words, MPI does not provide mechanisms for dealing with
> failures in the communication system.
> If the MPI implementation is built on an unreliable underlying
> mechanism, then it is the job of the implementor of the MPI subsystem
> to insulate the user from this unreliability, or to reflect
> unrecoverable errors as failures.
> Whenever possible, such failures will be reflected as errors in the
> relevant communication call.
> [REMOVE: Similarly, MPI itself provides no mechanisms for handling
> processor failures.]
> 
> MPI does not provide the user with transparent process recovery upon
> process failure.
> Once a process fails, MPI does not guarantee that the job can continue
> or, if the job can continue, that the process can be recovered.
> If the MPI implementation can continue operating after process failure
> then it must return an appropriate error class (e.g.,
> MPI_ERR_RANK_FAIL_STOP) and provide the additional semantics defined
> in Chapter 17.
> The MPI implementation documentation will provide information on the
> possible effect of each supported class of errors.
> 
> 
> Section 2.8 (Existing)
> ------------------------------------
> MPI provides the user with reliable message transmission.
> A message sent is always received correctly, and the user does not
> need to check for transmission errors, time-outs, or other error
> conditions.
> In other words, MPI does not provide mechanisms for dealing with
> failures in the communication system.
> If the MPI implementation is built on an unreliable underlying
> mechanism, then it is the job of the implementor of the MPI subsystem
> to insulate the user from this unreliability, or to reflect
> unrecoverable errors as failures.
> Whenever possible, such failures will be reflected as errors in the
> relevant communication call.
> Similarly, MPI itself provides no mechanisms for handling processor failures.
> 
> ...
> 
> This document does not specify the state of a computation after an
> erroneous MPI call has occurred.
> The desired behavior is that a relevant error code be returned, and
> the effect of the error be localized to the greatest possible extent.
> E.g., it is highly desirable that an erroneous receive call will not
> cause any part of the receiver’s memory to be overwritten, beyond the
> area specified for receiving the message.
> 
> Implementations may go beyond this document in supporting in a
> meaningful manner MPI calls that are defined here to be erroneous.
> For example, MPI specifies strict type matching rules between matching
> send and receive operations: it is erroneous to send a floating point
> variable and receive an integer.
> Implementations may go beyond these type matching rules, and provide
> automatic type conversion in such situations.
> It will be helpful to generate warnings for such non-conforming behavior.
> 
> 
> Section 8.3 (Existing)
> ------------------------------------
> After an error is detected, the state of MPI is undefined.
> That is, using a user-defined error handler, or MPI_ERRORS_RETURN,
> does not necessarily allow the user to continue to use MPI after an
> error is detected.
> The purpose of these error handlers is to allow a user to issue
> user-defined error messages and to take actions unrelated to MPI (such
> as flushing I/O buffers) before a program exits.
> An MPI implementation is free to allow MPI to continue after an error
> but is not required to do so.
> 
> Advice to implementors.
> A good quality implementation will, to the greatest possible extent,
> circumscribe the impact of an error, so that normal processing can
> continue after an error handler was invoked.
> The implementation documentation will provide information on the
> possible effect of each class of errors.
> (End of advice to implementors.)
> 
> 
> 
> On Wed, Jun 8, 2011 at 2:35 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
>> Per our conversation today, we wanted to have a paragraph clearly
>> defining what the MPI standard means by 'After an error is detected,
>> the state of MPI is undefined.'. Since it is defined for some classes
>> of errors. The paragraph would clarify further references of this
>> nature in the MPI standard.
>>
>> Note that this is slightly different than when the program (code) is
>> erroneous due to misuse of the MPI standard interfaces.
>>
>> A few places in the text to look:
>>  - Section 2.8: Error Handling - Paragraph 1 and 6
>>  - Section 8.3: Error Handling - Paragraphs 6 and 7.
>>  - Section 13.7: I/O Error Handling - Advice to users
>>
>> If the MPI implementation returns an error of MPI_ERR_RANK_FAIL_STOP
>> then it must provide the semantics defined in Chapter 17. We are not,
>> at this time, defining the semantic behavior of the MPI standard after
>> returning other errors.
>>
>> Any suggestions on possible wording?
>> Something like "The state of the computation after an error has
>> occurred may be undefined. A high-quality implementation will continue
>> afterwards. IF the implementation returns an error and the semantics
>> after the error are defined in the standard (e.g.,
>> MPI_ERR_RANK_FAIL_STOP in Chapter 17), then the implementation must
>> provide the specified semantics."
>>
>> Any suggestions on where to put the wording?
>> It was suggested that we change/update paragraphs 6 and 7 in Section
>> 8.3 appropriately.
>>
>>
>> Thoughts,
>> Josh
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
> 
> 
> 


-- 
Howard Pritchard
Software Engineering
Cray, Inc.



More information about the mpiwg-ft mailing list