[Mpi3-ft] Defining the state of MPI after an error

Wed Sep 22 16:46:35 CDT 2010

On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:

> 
> You lost me there - in part, i am saying it is useless because there are almost zero cases in which it would be appropriate.  How does that make it "a minor change"? 

Well I figure we're just adding an error class that the implementation can return to the user if it gives up and can't continue.  That's minor.  Whether or not it's useful is another story :-)

> Can you provide me the precise text you would add to the standard? Exactly how does the CANNOT_CONTINUE work?  Under what conditions does an MPI process see a CANNOT_CONTINUE and what does it mean? 

I don't know yet.  It might be something as simple as adding an entry to the error class table with a description like:

    Process can no longer perform any MPI operations.  If an MPI operation 
    returns this error class, all subsequent calls to MPI functions will 
    return this error class.

> Please look at the example again.  The point was that there is nothing there that would justify a CANNOT_CONTINUE and MPI is still working correctly. Despite that, the behavior is a mess from the algorithm viewpoint after the error. 

Since we haven't defined what happens in a failed collective yet, consider an implementation could will not continue after a failed collective.  The odd numbered processes that did not immediately return from barrier with an error will continue with the barrier protocol (say it's recursive doubling).  Some of the odd processes will need to send messages to some of the even processes.  Upon receiving these messages, the even processes will respond with an I_QUIT message, or perhaps the connection is closed, so the odd processes will get a communication error when trying to send the message.  In either case, the odd processes will notice that something's wrong with the other processes, and return an error.  The second barrier will return a CANNOT_CONTINUE on all of the processes.

OK, what if the odd processes can't determine that the even processes can't continue?  The odd processes would hang in the first barrier, and the even numbered processes would get a CANNOT_CONTINUE from the second barrier.

So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided the discombobulated scenario.

-d