[Mpi3-ft] Defining the state of MPI after an error
treumann at us.ibm.com
Wed Sep 22 17:55:04 CDT 2010
We are kind of going in circles because the context and rationale for
CANNOT_CONTINUE are still too ambiguous.
My argument is against adding it into the standard first and figuring out
later what it means.
I will wait for the ticket. If the ticket gives a full and convincing
specification of what the implementor and the user are to do with it,, I
will make my judgement based on the whole description.
If the ticket says "Put this minor change in today and we will decide
later what it means, I must lobby the Forum to reject the ticket..
1) all current errors detected by an MPI application map to an existing
error class. An error cannot map to two error classes so if some user
error handler is presently checking for MPI_ERR_OP after a non-SUCCESS
return from MPI_Reduce and the implementation moves the return code for
passing a bad OP from class MPI_ERR_OP to MPI_ERR_CANNOT_CONTINUE it has
just broken a user code.
2) Mandating that every MPI call after a MPI_ERR_CANNOT_CONTINUE must
return MPI_ERR_CANNOT_CONTINUE will require that every MPI call check a
global flag (resulting in overhead and possible displacement of other data
Dick Treumann - MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363
Darius Buntinas <buntinas at mcs.anl.gov>
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group"
<mpi3-ft at lists.mpi-forum.org>
09/22/2010 05:47 PM
Re: [Mpi3-ft] Defining the state of MPI after an error
mpi3-ft-bounces at lists.mpi-forum.org
On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:
> You lost me there - in part, i am saying it is useless because there are
almost zero cases in which it would be appropriate. How does that make it
"a minor change"?
Well I figure we're just adding an error class that the implementation can
return to the user if it gives up and can't continue. That's minor.
Whether or not it's useful is another story :-)
> Can you provide me the precise text you would add to the standard?
Exactly how does the CANNOT_CONTINUE work? Under what conditions does an
MPI process see a CANNOT_CONTINUE and what does it mean?
I don't know yet. It might be something as simple as adding an entry to
the error class table with a description like:
Process can no longer perform any MPI operations. If an MPI operation
returns this error class, all subsequent calls to MPI functions will
return this error class.
> Please look at the example again. The point was that there is nothing
there that would justify a CANNOT_CONTINUE and MPI is still working
correctly. Despite that, the behavior is a mess from the algorithm
viewpoint after the error.
Since we haven't defined what happens in a failed collective yet, consider
an implementation could will not continue after a failed collective. The
odd numbered processes that did not immediately return from barrier with
an error will continue with the barrier protocol (say it's recursive
doubling). Some of the odd processes will need to send messages to some
of the even processes. Upon receiving these messages, the even processes
will respond with an I_QUIT message, or perhaps the connection is closed,
so the odd processes will get a communication error when trying to send
the message. In either case, the odd processes will notice that
something's wrong with the other processes, and return an error. The
second barrier will return a CANNOT_CONTINUE on all of the processes.
OK, what if the odd processes can't determine that the even processes
can't continue? The odd processes would hang in the first barrier, and
the even numbered processes would get a CANNOT_CONTINUE from the second
So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided
the discombobulated scenario.
mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-ft