<br><font size=2 face="sans-serif">We are kind of going in circles because

the context and rationale for </font><tt><font size=2>CANNOT_CONTINUE</font></tt><font size=2 face="sans-serif">

are still too ambiguous.</font>

<br>

<br><font size=2 face="sans-serif">My argument is against adding it into

the standard first and figuring out later what it means. </font>

<br>

<br><font size=2 face="sans-serif">I will wait for the ticket. If the ticket

gives a full and convincing specification of what the implementor and the

user are to do with it,, I will make my judgement based on the whole description.

</font>

<br>

<br><font size=2 face="sans-serif">If the ticket says "Put this minor

change in today and we will decide later what it means, I must lobby the

Forum to reject the ticket..</font>

<br>

<br><font size=2 face="sans-serif">Note</font>

<br><font size=2 face="sans-serif">1)  all current errors detected

by an MPI application map to an existing error class. An error cannot map

to two error classes so if some user error handler is presently checking

for MPI_ERR_OP after a non-SUCCESS return from MPI_Reduce and the implementation

moves the return code for passing a bad OP from class MPI_ERR_OP to MPI_ERR_CANNOT_CONTINUE

it has just broken a user code.</font>

<br><font size=2 face="sans-serif">2) Mandating that every MPI call after

a MPI_ERR_CANNOT_CONTINUE must return MPI_ERR_CANNOT_CONTINUE will require

that every MPI call check a  global flag (resulting in overhead and

possible displacement of other data from cache)</font>

<br>

<br>

<br><font size=2 face="sans-serif">Dick Treumann  -  MPI Team

          <br>

IBM Systems & Technology Group<br>

Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>

Tele (845) 433-7846         Fax (845) 433-8363<br>

</font>

<br>

<br>

<br>

<table width=100%>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>

<td><font size=1 face="sans-serif">Darius Buntinas <buntinas@mcs.anl.gov></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>

<td><font size=1 face="sans-serif">"MPI 3.0 Fault Tolerance and Dynamic

Process Control working Group" <mpi3-ft@lists.mpi-forum.org></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>

<td><font size=1 face="sans-serif">09/22/2010 05:47 PM</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>

<td><font size=1 face="sans-serif">Re: [Mpi3-ft] Defining the state of

MPI after an error</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Sent by:</font>

<td><font size=1 face="sans-serif">mpi3-ft-bounces@lists.mpi-forum.org</font></table>

<br>

<hr noshade>

<br>

<br>

<br><tt><font size=2><br>

On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:<br>

<br>

> <br>

> You lost me there - in part, i am saying it is useless because there

are almost zero cases in which it would be appropriate.  How does

that make it "a minor change"? <br>

<br>

Well I figure we're just adding an error class that the implementation

can return to the user if it gives up and can't continue.  That's

minor.  Whether or not it's useful is another story :-)<br>

<br>

> Can you provide me the precise text you would add to the standard?

Exactly how does the CANNOT_CONTINUE work?  Under what conditions

does an MPI process see a CANNOT_CONTINUE and what does it mean? <br>

<br>

I don't know yet.  It might be something as simple as adding an entry

to the error class table with a description like:<br>

<br>

    Process can no longer perform any MPI operations.  If

an MPI operation <br>

    returns this error class, all subsequent calls to MPI functions

will <br>

    return this error class.<br>

<br>

> Please look at the example again.  The point was that there is

nothing there that would justify a CANNOT_CONTINUE and MPI is still working

correctly. Despite that, the behavior is a mess from the algorithm viewpoint

after the error. <br>

<br>

Since we haven't defined what happens in a failed collective yet, consider

an implementation could will not continue after a failed collective.  The

odd numbered processes that did not immediately return from barrier with

an error will continue with the barrier protocol (say it's recursive doubling).

 Some of the odd processes will need to send messages to some of the

even processes.  Upon receiving these messages, the even processes

will respond with an I_QUIT message, or perhaps the connection is closed,

so the odd processes will get a communication error when trying to send

the message.  In either case, the odd processes will notice that something's

wrong with the other processes, and return an error.  The second barrier

will return a CANNOT_CONTINUE on all of the processes.<br>

<br>

OK, what if the odd processes can't determine that the even processes can't

continue?  The odd processes would hang in the first barrier, and

the even numbered processes would get a CANNOT_CONTINUE from the second

barrier.<br>

<br>

So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided

the discombobulated scenario.<br>

<br>

-d<br>

<br>

<br>

<br>

_______________________________________________<br>

mpi3-ft mailing list<br>

mpi3-ft@lists.mpi-forum.org<br>

</font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</font></tt></a><tt><font size=2><br>

</font></tt>

<br>

<br>