<br><font size=2 face="sans-serif">We are kind of going in circles because
the context and rationale for </font><tt><font size=2>CANNOT_CONTINUE</font></tt><font size=2 face="sans-serif">
are still too ambiguous.</font>
<br>
<br><font size=2 face="sans-serif">My argument is against adding it into
the standard first and figuring out later what it means. </font>
<br>
<br><font size=2 face="sans-serif">I will wait for the ticket. If the ticket
gives a full and convincing specification of what the implementor and the
user are to do with it,, I will make my judgement based on the whole description.
</font>
<br>
<br><font size=2 face="sans-serif">If the ticket says "Put this minor
change in today and we will decide later what it means, I must lobby the
Forum to reject the ticket..</font>
<br>
<br><font size=2 face="sans-serif">Note</font>
<br><font size=2 face="sans-serif">1) all current errors detected
by an MPI application map to an existing error class. An error cannot map
to two error classes so if some user error handler is presently checking
for MPI_ERR_OP after a non-SUCCESS return from MPI_Reduce and the implementation
moves the return code for passing a bad OP from class MPI_ERR_OP to MPI_ERR_CANNOT_CONTINUE
it has just broken a user code.</font>
<br><font size=2 face="sans-serif">2) Mandating that every MPI call after
a MPI_ERR_CANNOT_CONTINUE must return MPI_ERR_CANNOT_CONTINUE will require
that every MPI call check a global flag (resulting in overhead and
possible displacement of other data from cache)</font>
<br>
<br>
<br><font size=2 face="sans-serif">Dick Treumann - MPI Team
<br>
IBM Systems & Technology Group<br>
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601<br>
Tele (845) 433-7846 Fax (845) 433-8363<br>
</font>
<br>
<br>
<br>
<table width=100%>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>
<td><font size=1 face="sans-serif">Darius Buntinas <buntinas@mcs.anl.gov></font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>
<td><font size=1 face="sans-serif">"MPI 3.0 Fault Tolerance and Dynamic
Process Control working Group" <mpi3-ft@lists.mpi-forum.org></font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>
<td><font size=1 face="sans-serif">09/22/2010 05:47 PM</font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>
<td><font size=1 face="sans-serif">Re: [Mpi3-ft] Defining the state of
MPI after an error</font>
<tr valign=top>
<td><font size=1 color=#5f5f5f face="sans-serif">Sent by:</font>
<td><font size=1 face="sans-serif">mpi3-ft-bounces@lists.mpi-forum.org</font></table>
<br>
<hr noshade>
<br>
<br>
<br><tt><font size=2><br>
On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:<br>
<br>
> <br>
> You lost me there - in part, i am saying it is useless because there
are almost zero cases in which it would be appropriate. How does
that make it "a minor change"? <br>
<br>
Well I figure we're just adding an error class that the implementation
can return to the user if it gives up and can't continue. That's
minor. Whether or not it's useful is another story :-)<br>
<br>
> Can you provide me the precise text you would add to the standard?
Exactly how does the CANNOT_CONTINUE work? Under what conditions
does an MPI process see a CANNOT_CONTINUE and what does it mean? <br>
<br>
I don't know yet. It might be something as simple as adding an entry
to the error class table with a description like:<br>
<br>
Process can no longer perform any MPI operations. If
an MPI operation <br>
returns this error class, all subsequent calls to MPI functions
will <br>
return this error class.<br>
<br>
> Please look at the example again. The point was that there is
nothing there that would justify a CANNOT_CONTINUE and MPI is still working
correctly. Despite that, the behavior is a mess from the algorithm viewpoint
after the error. <br>
<br>
Since we haven't defined what happens in a failed collective yet, consider
an implementation could will not continue after a failed collective. The
odd numbered processes that did not immediately return from barrier with
an error will continue with the barrier protocol (say it's recursive doubling).
Some of the odd processes will need to send messages to some of the
even processes. Upon receiving these messages, the even processes
will respond with an I_QUIT message, or perhaps the connection is closed,
so the odd processes will get a communication error when trying to send
the message. In either case, the odd processes will notice that something's
wrong with the other processes, and return an error. The second barrier
will return a CANNOT_CONTINUE on all of the processes.<br>
<br>
OK, what if the odd processes can't determine that the even processes can't
continue? The odd processes would hang in the first barrier, and
the even numbered processes would get a CANNOT_CONTINUE from the second
barrier.<br>
<br>
So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided
the discombobulated scenario.<br>
<br>
-d<br>
<br>
<br>
<br>
_______________________________________________<br>
mpi3-ft mailing list<br>
mpi3-ft@lists.mpi-forum.org<br>
</font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft</font></tt></a><tt><font size=2><br>
</font></tt>
<br>
<br>