[Mpi3-ft] The state of MPI is undefined

Josh Hursey jjhursey at open-mpi.org
Mon Jun 13 10:08:25 CDT 2011


Here is a suggested paragraph that should probably go in a modified
version of the existing section 2.8 wording below.

What do folks think about this? Do we need more, less, or something different?

Thanks,
Josh


Discussion:
------------------------------------
So MPI talks about a couple high level erroneous behavior:
 - Section 2.8 "an erroneous MPI call" meaning an MPI call of bad form
with regard to arguments and matching rules.
For example, a sending an integer and receiving a float - since
datatypes are not used in message matching.
 - Section 2.8 "reliable message transmission" MPI must mask any
instability in the networking stack.
 - Section 2.8 "processor failures" - nothing defined (no clarification)
 - Section 8.3 "error detected" could be an erroneous MPI call or
internal error.

We would like to say that if the MPI implementation returns
MPI_ERR_RANK_FAIL stop then they are required to provide the semantics
from Chapter 17. At the point they can not provide those semantics
then they must return some other error.


Suggest:
------------------------------------
MPI provides the user with reliable message transmission.
A message sent is always received correctly, and the user does not
need to check for transmission errors, time-outs, or other error
conditions.
In other words, MPI does not provide mechanisms for dealing with
failures in the communication system.
If the MPI implementation is built on an unreliable underlying
mechanism, then it is the job of the implementor of the MPI subsystem
to insulate the user from this unreliability, or to reflect
unrecoverable errors as failures.
Whenever possible, such failures will be reflected as errors in the
relevant communication call.
[REMOVE: Similarly, MPI itself provides no mechanisms for handling
processor failures.]

MPI does not provide the user with transparent process recovery upon
process failure.
Once a process fails, MPI does not guarantee that the job can continue
or, if the job can continue, that the process can be recovered.
If the MPI implementation can continue operating after process failure
then it must return an appropriate error class (e.g.,
MPI_ERR_RANK_FAIL_STOP) and provide the additional semantics defined
in Chapter 17.
The MPI implementation documentation will provide information on the
possible effect of each supported class of errors.


Section 2.8 (Existing)
------------------------------------
MPI provides the user with reliable message transmission.
A message sent is always received correctly, and the user does not
need to check for transmission errors, time-outs, or other error
conditions.
In other words, MPI does not provide mechanisms for dealing with
failures in the communication system.
If the MPI implementation is built on an unreliable underlying
mechanism, then it is the job of the implementor of the MPI subsystem
to insulate the user from this unreliability, or to reflect
unrecoverable errors as failures.
Whenever possible, such failures will be reflected as errors in the
relevant communication call.
Similarly, MPI itself provides no mechanisms for handling processor failures.

...

This document does not specify the state of a computation after an
erroneous MPI call has occurred.
The desired behavior is that a relevant error code be returned, and
the effect of the error be localized to the greatest possible extent.
E.g., it is highly desirable that an erroneous receive call will not
cause any part of the receiver’s memory to be overwritten, beyond the
area specified for receiving the message.

Implementations may go beyond this document in supporting in a
meaningful manner MPI calls that are defined here to be erroneous.
For example, MPI specifies strict type matching rules between matching
send and receive operations: it is erroneous to send a floating point
variable and receive an integer.
Implementations may go beyond these type matching rules, and provide
automatic type conversion in such situations.
It will be helpful to generate warnings for such non-conforming behavior.


Section 8.3 (Existing)
------------------------------------
After an error is detected, the state of MPI is undefined.
That is, using a user-defined error handler, or MPI_ERRORS_RETURN,
does not necessarily allow the user to continue to use MPI after an
error is detected.
The purpose of these error handlers is to allow a user to issue
user-defined error messages and to take actions unrelated to MPI (such
as flushing I/O buffers) before a program exits.
An MPI implementation is free to allow MPI to continue after an error
but is not required to do so.

Advice to implementors.
A good quality implementation will, to the greatest possible extent,
circumscribe the impact of an error, so that normal processing can
continue after an error handler was invoked.
The implementation documentation will provide information on the
possible effect of each class of errors.
(End of advice to implementors.)



On Wed, Jun 8, 2011 at 2:35 PM, Josh Hursey <jjhursey at open-mpi.org> wrote:
> Per our conversation today, we wanted to have a paragraph clearly
> defining what the MPI standard means by 'After an error is detected,
> the state of MPI is undefined.'. Since it is defined for some classes
> of errors. The paragraph would clarify further references of this
> nature in the MPI standard.
>
> Note that this is slightly different than when the program (code) is
> erroneous due to misuse of the MPI standard interfaces.
>
> A few places in the text to look:
>  - Section 2.8: Error Handling - Paragraph 1 and 6
>  - Section 8.3: Error Handling - Paragraphs 6 and 7.
>  - Section 13.7: I/O Error Handling - Advice to users
>
> If the MPI implementation returns an error of MPI_ERR_RANK_FAIL_STOP
> then it must provide the semantics defined in Chapter 17. We are not,
> at this time, defining the semantic behavior of the MPI standard after
> returning other errors.
>
> Any suggestions on possible wording?
> Something like "The state of the computation after an error has
> occurred may be undefined. A high-quality implementation will continue
> afterwards. IF the implementation returns an error and the semantics
> after the error are defined in the standard (e.g.,
> MPI_ERR_RANK_FAIL_STOP in Chapter 17), then the implementation must
> provide the specified semantics."
>
> Any suggestions on where to put the wording?
> It was suggested that we change/update paragraphs 6 and 7 in Section
> 8.3 appropriately.
>
>
> Thoughts,
> Josh
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey




More information about the mpiwg-ft mailing list