<br><font size=2 face="sans-serif">Josh & the FT Team</font>

<br>

<br><font size=2 face="sans-serif">Thanks for this decision to suspend

this ticket for now.</font>

<br>

<br><font size=2 face="sans-serif">I am not in any position to judge whether

the idea can be part of an integrated approach to Fault Tolerance.  If

it, or something similar, comes up in the context of an FT chapter and

is well defined in that context, I would not expect to have any objection.</font>

<br>

<br><font size=2 face="sans-serif">I expect the discussion of any FT chapter

brought to the Forum as a whole to be difficult but I hope we are generally

able so see it as a whole that can be accepted as ready for a standards

document. </font>

<br>

<br><font size=2 face="sans-serif">If not that, I hope it can be recognized

as broadly promising but not mature enough to standardize. and moved to

the Journal of R&D as a proposal that people can implement and refine

before a later decision to standardize.</font>

<br>

<br><font size=2 face="sans-serif">           

   Dick </font>

<br>

<br><font size=2 face="sans-serif">Dick Treumann  -  MPI Team

          <br>

IBM Systems & Technology Group<br>

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

Tele (845) 433-7846         Fax (845) 433-8363<br>

</font>

<br>

<br>

<br>

<table width=100%>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>

<td><font size=1 face="sans-serif">Joshua Hursey <jjhursey@open-mpi.org></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>

<td><font size=1 face="sans-serif">"MPI 3.0 Fault Tolerance and Dynamic

Process Control working Group" <mpi3-ft@lists.mpi-forum.org></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>

<td><font size=1 face="sans-serif">09/29/2010 01:30 PM</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>

<td><font size=1 face="sans-serif">Re: [Mpi3-ft] Defining the state of

MPI after an error</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Sent by:</font>

<td><font size=1 face="sans-serif">mpi3-ft-bounces@lists.mpi-forum.org</font></table>

<br>

<hr noshade>

<br>

<br>

<br><tt><font size=2>It was the feeling of the group during the teleconf

that this proposal should be suspended for the time being. I drafted a

'proposal resolution' rationale distilled from all of the discussion so

far and included it with the proposed text on the wiki. If you feel that

something is missing or misrepresented in the resolution text, please let

me know. This resolution is this text that we will point the supporters

of this idea to when they ask about its status.<br>

<br>

The text can be found at the following link:<br>

  </font></tt><a href="https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue#ProposalResolution"><tt><font size=2>https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue#ProposalResolution</font></tt></a><tt><font size=2><br>

<br>

-- Josh<br>

<br>

On Sep 23, 2010, at 3:54 PM, Joshua Hursey wrote:<br>

<br>

> <br>

> On Sep 23, 2010, at 10:40 AM, Richard Treumann wrote:<br>

> <br>

>> <br>

>> A few quick observations:<br>

>> <br>

>> 0)<br>

>> The constant is MPI_ERROR_ARE_FATAL, not MPI_ERRORS_ABORT<br>

> <br>

> Fixed. Thanks<br>

> <br>

>> <br>

>> 1)<br>

>> The MPI standard only mandates one return code, MPI_SUCCESS. All

other return codes are implementation specific and non-portable.  For

portability, MPI documents  error classes and a query function that

is passed an implementation defined return code and returns the class.<br>

>> <br>

>> Assume I allow tags between 0 and 2**15. As an MPI implementor,

I am free to use return code 215 for a negative tag and 399 for one that

is above 2**15.  The error message I print for 215 and the error message

I print for 399 can be different. If the user calls MPI_ERROR_CLASS() with

either 215  or 399 I give back the class MPI_ARR_TAG.  The user

who checks the RC of a call to see if it is == MPI_ERR_TAG has written

non-portable code.<br>

>> <br>

>> If I decide  return codes 251 and 399 must be in class MPI_CANNOT_CONTINUE

they can no longer be in class MPI_ERR_TAG.<br>

> <br>

> I understand, that is why MPI_ERR_CANNOT_CONTINUE is an error class.

We are not defining a new error code, but a new error class.<br>

> <br>

>> <br>

>> <br>

>> 2)<br>

>> The MPI standard avoids mandating specific error checks.  It

identifies a lot of errors and in many cases, says what error class that

error is in.  It does not say an implementation MUST detect the error.

I would not violate the standard by skipping the check of whether MPI is

initialized. My customers may demand it but the standard does not.  You

are introducing a mandate for one specific sort of error.<br>

> <br>

> Yes that is covered in the opening paragraph of 8.3. We are not mandating

that an implementation must detect errors, but that if it does detect an

error it must return an error code in an error class. So no change in this

regard from the existing standard. This proposal adds that if the MPI implementation

decides that it cannot continue after it detects the original error then

it must return an error code in the MPI_ERR_CANNOT_CONTINUE error class

on subsequent MPI calls. So I guess you could say that it must detect that

it is itself in an self-defined erroneous state in order to return MPI_ERR_CANNOT_CONTINUE.<br>

> <br>

> On a related note, the standard language does not specify how an implementation

goes about implementing the MPI_ERR_CANNOT_CONTINUE condition. The  prototype

will likely check this value upon entry to every function call. A snazzy

implementation may replace all the MPI function calls by function pointers

to a dummy function that only returns MPI_ERR_CANNOT_CONTINUE. Resetting

these function calls may occur in the error handler at the time original

error. This technique would not require a check each time one enters the

library. One could imagine other techniques that would equally avoid the

overhead of checking this global variable.<br>

> <br>

>> <br>

>> 3)<br>

>> I am convinced that the intent of the standard is to require MPI_ERROR_CLASS,

MPI_ERROR_STRING and MPI_ABORT to work after an ERRORS_RETURN. If this

is insufficiently clear, it should probably be addressed in a stand alone

ticket.  (it is certainly possible for an error (detected or not)

to trash internal state and for that to make one of these three unusable

but that applies to every MPI call. The standard does not say MPI_Send

must work even if state was scrambled by a wild store).  I do not

know if anybody assumed MPI_INITIALIZED and MPI_FINALIZED must work after

an error. I see no harm in requiring it.<br>

> <br>

> It is insufficiently clear (to me at at least) from the current standard

if these functions are able to be used after an error. During the plenary

session last week, it was mentioned by a number of people that this clarification

should be part of the formal proposal for this new error class. We can

probably pull it out into a separate ticket, but they need to be clarified

as part of this ticket anyway.<br>

> <br>

>> <br>

>> 5)<br>

>> Finally - I do not see that the ticket does anything useful.  In

particular, it does not provide any portability improvements I can see.<br>

>> <br>

>> The MPI implementation could offer a TIMID vs ADVENTUROUS switch

(environment variable)<br>

>> <br>

>> TIMID - MPI query functions like MPI_COMM_SIZE and MPI_ALLOC_MEM

do not trigger CANNOT_CONTINUE but every other error does.<br>

>> <br>

>> ADVENTUROUS - no error triggers CANNOT_CONTINUE.<br>

>> <br>

>> The default would probably need to be TIMID because if the default

were ADVENTUROUS, it would open the implementor to an accusation of failing

to protect the customer. There can be no such accusation now because the

standard does not imply the implementation should protect the customer.<br>

>> <br>

>> I have no clue from the ticket what would be a reasonable or portable

middle ground.    I see the proposal as harmful because any attempt

to use it will produce an illusion of portability when implementors try

to find a middle ground without guidance form the standard.<br>

> <br>

> This thread has belabored the point that it is difficult to standardize

the middle ground. The expected behavior after an error is returned will

likely depend on the error class, MPI call, usage scenario, and, likely,

how the call is implemented. For this reason it is difficult to standardize

the behavior after any specific error class. If an implementation finds

some scenarios where it wishes to be adventurous and continue working then

it is allowed to do so, as long as it documents this behavior.<br>

> <br>

> But what should the MPI implementation do if it finds that it cannot

continue working properly any more? It should probably return some type

of error to the user letting them know that it has stopped functioning.

The current standard says nothing about what should be returned in this

scenario. This proposal adds the MPI_ERR_CANNOT_CONTINUE error class to

fill this gap.<br>

> <br>

> With this new error class the user can now know that they should expect

MPI_ERR_CANNOT_CONTINUE from all future calls unless the MPI library is

attempting to recover. So if they detect a different error class or MPI_SUCCESS

then they know the implementation is up to something and to proceed carefully

in an implementation specific manner. With the current standard, one implementation

may use MPI_ERR_OTHER as the return class for future calls where another

implementation may use MPI_ERR_UNKNOWN and still another MPI_ERR_INTERN.

So an application is unable to determine if the MPI_ERR_{OTHER|UNKNOWN|INTERN}

was because of their use of the interface, or because of the library ceasing

to function properly. Instead of mandating that one of the existing error

classes fill this role, we introduced the new error class. Now the application

can use this as a standardized red flag marking the limits of the ability

of the MPI library to function correctly.<br>

> <br>

> <br>

> I acknowledge that this is a small advancement of the standard, but

it provides at least some bound on the undefined behavior. So now there

is a middle ground to speak of, instead of just everything else after the

error.<br>

> <br>

> -- Josh<br>

> <br>

> <br>

>> <br>

>> Dick Treumann  -  MPI Team<br>

>> IBM Systems & Technology Group<br>

>> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

>> Tele (845) 433-7846         Fax (845) 433-8363<br>

>> <br>

>> <br>

>> <br>

>> From:   Joshua Hursey <jjhursey@open-mpi.org><br>

>> To:     "MPI 3.0 Fault Tolerance and Dynamic Process

Control working Group" <mpi3-ft@lists.mpi-forum.org><br>

>> Date:   09/23/2010 08:57 AM<br>

>> Subject:        Re: [Mpi3-ft] Defining the

state of MPI after an error<br>

>> Sent by:        mpi3-ft-bounces@lists.mpi-forum.org<br>

>> <br>

>> ________________________________<br>

>> <br>

>> <br>

>> <br>

>> (Bringing a lot of points together in a single response)<br>

>> <br>

>> The ticket that we are discussing is linked below (also part of

the very first email in this thread):<br>

>> </font></tt><a href="https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue"><tt><font size=2>https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue</font></tt></a><tt><font size=2><br>

>> <br>

>> < snip ><br>

>> <br>

>> I deleted the discussion because only the ticket counts now.<br>

>> <br>

>> <br>

>> <ATT00001..txt><br>

> <br>

> ------------------------------------<br>

> Joshua Hursey<br>

> Postdoctoral Research Associate<br>

> Oak Ridge National Laboratory<br>

> </font></tt><a href=http://www.cs.indiana.edu/~jjhursey><tt><font size=2>http://www.cs.indiana.edu/~jjhursey</font></tt></a><tt><font size=2><br>

> <br>

> <br>

> _______________________________________________<br>

> mpi3-ft mailing list<br>

> mpi3-ft@lists.mpi-forum.org<br>

> </font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</font></tt></a><tt><font size=2><br>

> <br>

<br>

------------------------------------<br>

Joshua Hursey<br>

Postdoctoral Research Associate<br>

Oak Ridge National Laboratory<br>

</font></tt><a href=http://www.cs.indiana.edu/~jjhursey><tt><font size=2>http://www.cs.indiana.edu/~jjhursey</font></tt></a><tt><font size=2><br>

<br>

<br>

_______________________________________________<br>

mpi3-ft mailing list<br>

mpi3-ft@lists.mpi-forum.org<br>

</font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</font></tt></a><tt><font size=2><br>

</font></tt>

<br>

<br>