[Mpi3-ft] Defining the state of MPI after an error

Wed Sep 22 15:04:54 CDT 2010

An incident that trashes MPI internal data structures will seldom be 
recognizable as a trigger for a non_SUCCESS rc on a particular call. If 
you do an MPI_Recv and the buffer pointer happens to drop the data all 
over MPI internal state, there is about zero chance the MPI_Recv call will 
be able to detect that.  It will just return MPI_SUCCESS and depending on 
what you trashed, things may run fine or may break in unpredictable ways.

I have absolutely no problem with a community agreement to try some 
prototyping of ideas that will be proposed for the standard if they prove 
out.

Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

From:
"Bronevetsky, Greg" <bronevetsky1 at llnl.gov>
To:
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org>
Date:
09/22/2010 03:37 PM
Subject:
Re: [Mpi3-ft] Defining the state of MPI after an error
Sent by:
mpi3-ft-bounces at lists.mpi-forum.org

One candidate for CANNOT_CONTINUE would be a data corruption in MPI memory 
or some data structure inconsistency due to a bug. This could have zero 
effect or could corrupt application results or system state. It would be 
exceedingly difficult for MPI to do anything meaningful here and continued 
operation is potentially very dangerous. As such, I would consider this to 
be a bad enough error to return CANNOT_CONTINUE. 

I think the point of this proposal is not that CANNOT_CONTINUE is going to 
be a common error but to lay the groundwork for a more useful error 
reporting scheme. Today we’re quite sure that CANNOT_CONTINUE will be the 
worst thing that an MPI implementation will want to return but we’re not 
really sure about what the other errors will look like. For example, 
RANK_DEAD and LINK_DEAD sound like plausible error messages but we won’t 
know until individual implementations have had a chance to experiment 
them. This proposal allows such experimentation to happen within the same 
basic error reporting framework.

Having said that, I’m not completely convinced that we need to include 
this in the spec yet or whether this can be more like a community 
agreement until we understand the problem better.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com 

From: mpi3-ft-bounces at lists.mpi-forum.org [
mailto:mpi3-ft-bounces at lists.mpi-forum.org] On Behalf Of Richard Treumann
Sent: Wednesday, September 22, 2010 11:17 AM
To: MPI 3.0 Fault Tolerance and Dynamic Process Control working Group
Subject: Re: [Mpi3-ft] Defining the state of MPI after an error

Darius 

I can imagine a few errors that I know will be harmless to MPI state. I 
can make sure nobody can do any harm by passing an invalid communicator to 
MPI_COMM_SIZE. 

I cannot think of a detectable error that would return and leave that 
thread of that process so totally broken that nothing in MPI will work 
from then on. In a collective, there may be processes in which the thread 
that called the CC never returns and that tread of the process is no 
longer usable because it is hung.  Other threads using other communicators 
in the process with a hung thread may work perfectly. 

Except for the very few cases where I know there was no damage (like a bad 
comm on MPI_COMM_SIZE) the situation, 99.99% of the time, will be that 
everything still works but sometimes the outcome is a surprise to the 
user.  Say you  do: 

1 MPI_Barrier (on world) 
2 MPI_Barrier (on world): 
3 other stuff 
4 MPI_Barrier (on world) 
5 if (my rank is even) 
6      sendrecv(with odd neighbor) 
7  else 
8     sendrecv(with even neighbor) 

but get back an error at  all even numbered ranks from the line 1 barrier 
call. The line 2 MPI_Barrier may still "work" but the line 2 barrier at 
even numbered ranks will match the line 1 barrier at odd ranks. Even ranks 
will begin "other stuff" and odd ranks will sit in the line 2 barrier 
until even ranks  finish "other stuff" and reach the line 4 barrier. The 
odd ranks now get through their line 2 barrier and begin other stuff. 

If "other stuff" involves communication among  the even ranks and 
communication among the odd ranks. that will work too. The even ranks will 
all send/recv among themselves later the odd ranks will all send/recv 
among themselves. 

The even ranks will reach line 6 and hang there because the odd tasks are 
still stuck at line 4.   

In this entire example, libmpi has continued working "correctly" but the 
behavior you get from correct behavior is not what you planned.   

The situation of MPI state being totally trashed by an error that returns 
a return code barely exists.  The case where it is subtly discombobulated 
is the norm. 

Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

From: 
Darius Buntinas <buntinas at mcs.anl.gov> 
To: 
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org> 
Date: 
09/22/2010 12:24 PM 
Subject: 
Re: [Mpi3-ft] Defining the state of MPI after an error 
Sent by: 
mpi3-ft-bounces at lists.mpi-forum.org

OK, I (think I) see what you guys are saying, so maybe we should look at 
it this way.  The CANNOT_CONTINUE proposal should not define the operation 
of the MPI implementation after errors other than CANNOT_CONTINUE. 
Instead, it defines that after the implementation gives a CANNOT_CONTINUE 
error, the app knows that the implementation is fatally wedged, and that 
the user should definately not expect correct operation after this.  I.e., 
we're not labeling other errors as "recoverable," we're just marking 
CANNOT_CONTINUE as "unrecoverable."

Note that an implementation can still be standard compliant even if it 
never returns a CANNOT_CONTINUE error even when it is fatally wedged 
(because operation after any other error is still undefined).

This just defines a way for the implementation to let the user know that 
it has given up.  So that if the implementation provides best-effort 
functionality after an error, and the user has "read the disclaimer" and 
is comfortable with proceeding, this is a way to differentiate between an 
error as a result of a failure that hosed everything, and one that may 
allow things to continue.

We still would like to define what happens to a bcast after a process in 
the communicator fails.  But we leave that for future proposals.

Does this make sense?
-d

On Sep 22, 2010, at 8:43 AM, Terry Dontje wrote:

> Richard Treumann wrote:
>> 
>> This proposal is not a minor change. 
>> 
>> Please do not make this hole in the standard and assume you can later 
add language to standardize everything that comes through the hole. 
>> 
>> If the standard is to introduce the notion of a recoverable error it 
must be as part of a full description of what "recovery" means. 
>> 
>> I think is is dangerous and ultimately useless to have implementors 
mark a failure as "recoverable" when the post error state of the 
distributed MPI has gone from "fully standards compliant" to "mostly 
standards compliant, read my user doc read my legal disclaimer, cross your 
fingers". 
>> 
>> See comment below for why I do not think the new hole is needed to 
allow people to do implementation specific recoverability. 
>> 
>> There is not even anything to prevent on implementation from deciding 
to add a function MPXX_WHAT_STILL_WORKS(err_code, answer) and documenting 
5 or 5000 enumerated values for "answer" ranging from NOTHING through 
TAKE_A_CHANCE_IF_YOU_LIKE to  EVERYTHING. 
>> 
>> IBM would probably return TAKE_A_CHANCE_IF_YOU_LIKE because I cannot 
imagine how we would promise exactly what will work and what will not but 
in practice most things will still work as expected. 
>> 
> I think I agree with Dick on the above.  Another way of putting the 
disagreement is that Josh's proposal is too general in that not all 
errorcodes can be completely marked as MPI state is broken or not.  When 
Sun implemented fault tolerant client/server we came up with a new error 
class that when returned gave the user the understanding that a condition 
occurred on a communicator that has rendered the communicator useless and 
one should clean it up before continuing on.  The point is there was a 
concrete understanding of the error and what could be done to recover.  As 
opposed to a general class that say's everything is borked or not which 
essential doesn't give you much because you'll end up eventually having to 
define a more specific class of error IMO.
> 
> --td
>> 
>> 
>> 
>> Dick Treumann  -  MPI Team 
>> IBM Systems & Technology Group
>> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
>> Tele (845) 433-7846         Fax (845) 433-8363
>> 
>> 
>> mpi3-ft-bounces at lists.mpi-forum.org wrote on 09/21/2010 04:54:08 PM:
>> 
>> > [image removed] 
>> > 
>> > Re: [Mpi3-ft] Defining the state of MPI after an error 
>> > 
>> > Bronis R. de Supinski 
>> > 
>> > to: 
>> > 
>> > MPI 3.0 Fault Tolerance and Dynamic Process Control working Group 
>> > 
>> > 09/21/2010 04:59 PM 
>> > 
>> > Sent by: 
>> > 
>> > mpi3-ft-bounces at lists.mpi-forum.org 
>> > 
>> > Please respond to "Bronis R. de Supinski", "MPI 3.0 Fault Tolerance 
>> > and Dynamic Process Control working Group" 
>> > 
>> > 
>> > Dick:
>> > 
>> > Re:
>> > > The current MPI standard does not say the MPI implementation is 
totally 
>> > > broken once there is an error.  Saying MPI state is undefined after 
an 
>> > > error simply says that the detailed semantic of the MPI standard 
can no 
>> > > longer be promised. In other words, after an error you leave behind 
the 
>> > > security of a portable standard semantic.  You are operating at 
your own 
>> > > risk. You do not need to read more than that into it.
>> > 
>> > Perhaps my problem with this position is that I come from the
>> > background of language definitions for compilers. When you
>> > read "undefined" in the OpenMP specification then you are
>> > being told that things are broken and the implementation does
>> > need to do anything or even tell you what they actually do (and
>> > I believe the same is true for the C and C++ standards). An
>> > alternative is "implementation defined", which requires the
>> > implementer to document what they actually do. Without that,
>> > you cannot even rely on actions with a specific implementation
>> > (unless you believe "My tests so far have not failed so I am OK").
>> 
>> 
>> When a standard says behavior is "undefined" in some situation, it 
cannot mean behavior is "broken". It cannot mean the implementor is 
prohibited from making it still work. It cannot mean the implementor is 
prohibited from making certain things work and documenting them. Any 
statement like this in a standard would be definition of behavior and the 
behavior would no longer be "undefined". 
>> 
>> The only thing a standard can logically mean by "undefined" is that the 
STANDARD no longer mandates the definition. 
>> 
>> Bronis says: 
>> 
>> > 
>> > I strongly feel "undefined" should be reserved for situations that
>> > mean "your program is irrevocably broken and the implementer does
>> > not need to worry about what happens to it after encountering them." 
>> 
>> I would say this as: 
>> 
>> I strongly feel "undefined" should be reserved for situations that mean 
"The standard no longer guarantees your program is not irrevocably broken. 
The implementer is not required by the standard to worry about what 
happens to it after encountering them. An Implementation is free to 
provide any "better" behavior that may be of value but users cannot assume 
another implementation provides similar behavior so cannot assume 
standards defined portability." 
>> 
>> I do not see how the use if the word "undefined" in a standard can be 
interpreted as a prohibition of any behavior an implementation might 
offer. 
>> 
>> 
>> 
>> 
>> 
>> 
>> _______________________________________________
>> mpi3-ft mailing list
>> 
>> mpi3-ft at lists.mpi-forum.org
>> http://BLOCKEDlists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
>> 
>> 
>> 
> 
> 
> -- 
> <Mail Attachment.gif>
> Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
> Oracle - Performance Technologies
> 95 Network Drive, Burlington, MA 01803
> Email terry.dontje at oracle.com
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://BLOCKEDlists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://BLOCKEDlists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
_______________________________________________
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100922/aae315fb/attachment-0001.html>