[Mpi3-ft] Defining the state of MPI after an error

Bronis R. de Supinski bronis at llnl.gov
Wed Sep 22 11:27:59 CDT 2010


> This proposal is not a minor change.

I am not certain to which proposal you are referring.

> Please do not make this hole in the standard and assume you can later 
> add language to standardize everything that comes through the hole.

Again, I am not certain to what you are referring. My opinion
is that the current status is a hole in the standard. Leaving
it simply as "undefined" means that no one can rely on anything
other than "It won't work" and they have no leverage to ask
what the actual state is.

> If the standard is to introduce the notion of a recoverable error it 
> must be as part of a full description of what "recovery" means.

I have not introduced the notion of a recoverable error in
what I suggested. I guess you must be referring to Josh's
proposal although it is still unclear. I have merely stated
that we should require documentation in some situations.

> I think is is dangerous and ultimately useless to have implementors 
> mark a failure as "recoverable" when the post error state of the 
> distributed MPI has gone from "fully standards compliant" to "mostly 
> standards compliant, read my user doc read my legal disclaimer, cross 
> your fingers".

My suggestion is not "mostly standards compliant". If you read
it that way then you misconstrue what is accepted practice in
a wide range of existing standards.

> See comment below for why I do not think the new hole is needed to 
> allow people to do implementation specific recoverability.

Obviously, we disagree. I can only say that my position is based
on accepted practice in many standards while yours is based on
what you feel should be the case.

> There is not even anything to prevent on implementation from deciding 
> to add a function MPXX_WHAT_STILL_WORKS(err_code, answer) and 
> documenting 5 or 5000 enumerated values for "answer" ranging from 

I agree there is not. However, you ware under no obligation to
document it. "Implementation defined" implies that obligation.
Nothing more.

> IBM would probably return TAKE_A_CHANCE_IF_YOU_LIKE because I cannot 
> imagine how we would promise exactly what will work and what will not 
> but in practice most things will still work as expected.

So, ultimately, your position is that the state is undefined
in the relevant situations. I find that not very useful.

> Dick Treumann  -  MPI Team
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
> mpi3-ft-bounces at lists.mpi-forum.org wrote on 09/21/2010 04:54:08 PM:
>> [image removed]
>> Re: [Mpi3-ft] Defining the state of MPI after an error
>> Bronis R. de Supinski
>> to:
>> MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
>> 09/21/2010 04:59 PM
>> Sent by:
>> mpi3-ft-bounces at lists.mpi-forum.org
>> Please respond to "Bronis R. de Supinski", "MPI 3.0 Fault Tolerance
>> and Dynamic Process Control working Group"
>> Dick:
>> Re:
>>> The current MPI standard does not say the MPI implementation is totally
>>> broken once there is an error.  Saying MPI state is undefined after an
>>> error simply says that the detailed semantic of the MPI standard can no
>>> longer be promised. In other words, after an error you leave behind the
>>> security of a portable standard semantic.  You are operating at your own
>>> risk. You do not need to read more than that into it.
>> Perhaps my problem with this position is that I come from the
>> background of language definitions for compilers. When you
>> read "undefined" in the OpenMP specification then you are
>> being told that things are broken and the implementation does
>> need to do anything or even tell you what they actually do (and
>> I believe the same is true for the C and C++ standards). An
>> alternative is "implementation defined", which requires the
>> implementer to document what they actually do. Without that,
>> you cannot even rely on actions with a specific implementation
>> (unless you believe "My tests so far have not failed so I am OK").
> When a standard says behavior is "undefined" in some situation, it 
> cannot mean behavior is "broken". It cannot mean the implementor is 
> prohibited from making it still work. It cannot mean the implementor is 
> prohibited from making certain things work and documenting them. Any 
> statement like this in a standard would be definition of behavior and 
> the behavior would no longer be "undefined".

When a standard says behavior is "undefined", it means that ANY
behavior is valid AND the implementer has no obligation to
document how their implementation behaves. In practice, this
means the user must assume that things are broken. You can
argue that you document things differently but that does not
change the meaning of the standard.

> The only thing a standard can logically mean by "undefined" is that the 
> STANDARD no longer mandates the definition.

I guess we agree here.

> Bronis says:
>> I strongly feel "undefined" should be reserved for situations that
>> mean "your program is irrevocably broken and the implementer does
>> not need to worry about what happens to it after encountering them."
> I would say this as:
> I strongly feel "undefined" should be reserved for situations that mean 
> "The standard no longer guarantees your program is not irrevocably 
> broken. The implementer is not required by the standard to worry about 
> what happens to it after encountering them. An Implementation is free to 
> provide any "better" behavior that may be of value but users cannot 
> assume another implementation provides similar behavior so cannot assume 
> standards defined portability."

Except you are not obligated to document it. I suppose you think
that "I can document it if I like so what's the difference"
but it is significant in practice.

> I do not see how the use if the word "undefined" in a standard can be 
> interpreted as a prohibition of any behavior an implementation might 
> offer.

At no point did I state that it prohibits the implementation from
offering any behavior. I stated that it has no obligation for
documentation and that any sane user assumes that things are
broken when they are not even assured that documentation will
be available. I base this position on experience.


More information about the mpiwg-ft mailing list