[Mpi3-ft] Defining the state of MPI after an error

Thu Sep 23 07:54:15 CDT 2010

(Bringing a lot of points together in a single response)

The ticket that we are discussing is linked below (also part of the very first email in this thread):
  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/err_cannot_continue
So we are not talking about some ambiguous 'cloud of smoke', but the text as it appears there. If there is some specific language that you have a problem with then we can fix it before it goes before the forum (which is the purpose of this email thread).  If you have a problem with the concept of the ticket, then maybe you can advise us on a better rationale scenario to cover in this proposal.

As far as performance issues, I am having a really hard time believing that it will impact performance in a substantial way. The vast majority of MPI function calls must check to see if the MPI library has been initialized and not been finalized before calling any lower level functionality. These two checks (and the other parameter checks) add a few branches to the calling path already. If this proposal is to be seen as a big hammer approach, then it adds one more branch statement to the parameter check block to determine if the library can or cannot be used. It is my opinion that the library should already be doing such a check to protect itself from unsafe actions after an error, but the current standard does not require it.

As Darius mentioned this is a proposal that defines the upper bound on the space between 'no errors' and 'totally borked'. Where 'no errors' maps to MPI_SUCCESS, and 'totally borked' maps to MPI_ERR_CANNOT_CONTINUE. The MPI implementation can define the behavior in the space between unless the standard mandates some specific behavior in the future. If the implementation only wants to block out some functions (it must document this, as with the current standard) it can use the MPI_ERR_UNSUPPORTED_OPERATION error class.

Again, if the MPI implementation does not return MPI_ERR_CANNOT_CONTINUE on all subsequent operations after returning the original error, then it must document this for application/tool developers. In the current standard there is no standard defined return code that serves this purpose, so it is left to the user to hope for the best. With this proposal the application developer will know that if they use an MPI-3 compliant MPI implementation (and this proposal is included) then:
if( MPI_SUCCESSS != (ret = MPI_foo()) ) {
  if( MPI_Is_continuable() ) {
    /* then we will proceed in an MPI implementation defined manner */
  } else {
    /* All MPI interfaces are blocked from further use */
  }
}
The current standard only allows the 'proceed in an MPI implementation defined manner'. This proposal adds the 'else' case above.

This proposal can be seen as a 'big hammer' type of proposal since it requires either MPI_ERR_CANNOT_CONTINUE or MPI implementation defined semantics after an error. The basic implementation will lock the application process receiving the error out of the MPI library after an error. This is a simple check akin to initialized/finalized, with (very) little performance impact. A good quality MPI implementation will (as the proposal preserves from the original standard):
--------------
Advice to implementors. A good quality implementation will, to the greatest possible extent, circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked. The implementation documentation will provide information on the possible effect of each class of errors. (End of advice to implementors.)
--------------

I would advocate in the implementation that I work with (Open MPI) that we can define sane semantics after returning some error classes under some circumstances. We will document these situations in out user documentation (currently the FAQ), and if enough users like these semantics we might even build up our courage to propose them before the fault tolerance working group then the full forum. For all those scenarios where we cannot decide or determine sane semantics after an error class, we need something that we can return to the user letting them know that we do not want to proceed. This proposal introduces the MPI_ERR_CANNOT_CONTINUE error class to serve this purpose.

This conversation has escalated into a heated debate, which is great since I would rather have this type of discussion in the working group context instead of the full forum context. This debate will hopefully make the proposal better (or possibly even kill it). I appreciate everyones feedback and patience with this thread. Needless to say this proposal will be a topic for discussion in our next teleconf (on Sept. 29). Hopefully a phone conversation will help us break through some of the confusion. In the mean time we should continue to discuss on the mailing list and hopefully work towards some resolution/clarification.

Cheers,
Josh

On Sep 23, 2010, at 6:37 AM, Graham, Richard L. wrote:

> Thanks for the comments - responses in line
> 
> 
> On 9/23/10 5:45 AM, "Terry Dontje" <terry.dontje at oracle.com> wrote:
> 
> Just saw Dick's email and I guess I will push this rock a little further myself and start attending the FT calls if I can.
> 
> Graham, Richard L. wrote:
> 
> Dick,
>  What are your objections here ?  All the current proposal is doing is trying to define a set of consistent return codes, and is not changing anything about MPI.  I am not sure there is sufficient information to act on in all cases with the current error handling in MPI, but may be wrong on this.  However, there is nothing else that is being proposed at this stage.
> 
> My main objection is that what I heard been discussed does not seem to add value.
> 
> [rich] Yes, on it's own it adds little value, but not totally useless, but this is not an end of its own.  Both you and Richard missed the context of the MPI Forum - not sure why you could not pick this up from the ether :-)
> 
>  Also, if no changes are made to current implementations, there is a chance that apps will hang, but this is no different than if users set errors return today.  Which would need to do be done to see the effects of the changes.
>  Now, it is fair to ask if this really adds something.  If there is intent to recover from errors, which is what the FT working group is trying to figure out, then this has a lot of value in that it is the venue for an initial discussion of how to extend the error classes, which is really what Josh has been trying to do.  I believe we need an implementation and some app experiments to find what we have missed.
> 
> If this is being proposed just to bootstrap the discussion then fine I think, at least for the group, you've gotten some traction here ;-).
> 
> :-)
> 
>  As for the choice of "MPI_ERR_CANNOT_CONTINUE" how is this any different than malloc returning null ?
> The problem as I see it is there might be very few times a library is completely out of commission and probably fewer times than that when the library actually know it is completely out of commission.  Now maybe the point here is that this acts as a bridge between the current state of a library that supports MPI 2.2 to when a library supports all of the FT parts of MPI 3.0.  That is, as I mentioned before, all libraries that have questions as to there FT support should automatically start returning MPI_ERR_CANNOT_CONTINUE after most/every error until they can fully support FT.
> 
> [rich] Yes, this is the point.  For the sort of environment I work in, the network stack returns sufficient information so that the library can know what is going on.  The supporting run-times are also used to detect/notify when process fail, so the library can at least detect state.  Responding does involve some changes to the library.  I don't think it is unreasonable to respond with a "non-fatal" error code when resource requests can't be satisfied.
> 
> Ok to be fair, there are probably libraries (OMPI) that do not necessarily handle the loss of a connection very well in certain cases and you could possibly mark a global to say the MPI library is borked in that case.  However, if I was going to map out supporting MPI 3.0 FT I'd probably implement the whole thing taking into consideration these cases and returning the appropriate error instead of relying on the MPI_ERR_CANNOT_CONTINUE big hammer.
> 
> [rich] Not sure that I would mark global state in this manner.  But this really depends on how far the implementation wants to go just for this change.  I believe the current proposal is a starting point for both the app and the library to start addressing the issues associated with failure.  I expect that once you get into the guts of the library, and try to get beyond the equivalent of one big lock for threading, a pile of implementation issues will come up, and this is where the real value of this proposal is.
> 
> It tells the user that the library is no longer functional, and leaves it to the app to decide how to respond.  Depending on the implementation, there are "error" scenarios that both the app and the MPI library can survive.  Failure of alloc_mem may be such a function.  An app may also decide that sending data to a specific destination may  also be ok - I can give a several real use cases that were brought to us as we were looking into this that would be just fine with this.  Now the collective operations are another question.
>  So, all this proposal is really doing is start to revive the FT discussion at the Forum level, as partial implementation is getting to a state that it can be evaluated.  This is really why it is important to understand the specific shortcomings you see in what is being proposed - just the error propagation issues.
> 
> 
> Ok, so I see two issues off the bat mentioned:
> 
> 1.  Will an MPI library really know or ever want to throw MPI_ERR_CANNOT_CONTINUE.
>     IMO, I think the answer should be no in that an MPI library should throw errors that are
>     specific to whether a communicator or a connection to a rank is operable.  Because I think
>     there are no cases (except for bugs in the library) that a library is completely borked
>     such that nothing can be done.  I'll admit today that may not completely be true but in a 3.0
>     world shouldn't it?
> 
> [rich] I think that experience will guide us here.   I would tend to agree with you, if the library is aiming to recover from failures, but am reluctant to make a sweeping statement w/o more evidence.  Libraries that do not want to support recovery are the candidates for returning such error codes.
> 
> 2.  The point Dick makes about the checking of a global value to determine if a library
>     is borked could mess up cache performance is also something we should be concerned with.
> [rich] Agreed - I would not implement this in that manner.  This is ok for some prototyping, but I would tend to implement things in a manner that when failure occurs, the library may do find out about the error deep down in the library, so that the common cause does not take the performance hit.  I can actually see how to do this for pt-2-pt communications, but have not thought about collectives, for file ops.
> 
> Rich
> 
> --td
> 
> Thanks,
> Rich
> 
> On 9/22/10 6:55 PM, "Richard Treumann" <treumann at us.ibm.com> <mailto:treumann at us.ibm.com>  wrote:
> 
> 
> We are kind of going in circles because the context and rationale for CANNOT_CONTINUE are still too ambiguous.
> 
> My argument is against adding it into the standard first and figuring out later what it means.
> 
> I will wait for the ticket. If the ticket gives a full and convincing specification of what the implementor and the user are to do with it,, I will make my judgement based on the whole description.
> 
> If the ticket says "Put this minor change in today and we will decide later what it means, I must lobby the Forum to reject the ticket..
> 
> Note
> 1)  all current errors detected by an MPI application map to an existing error class. An error cannot map to two error classes so if some user error handler is presently checking for MPI_ERR_OP after a non-SUCCESS return from MPI_Reduce and the implementation moves the return code for passing a bad OP from class MPI_ERR_OP to MPI_ERR_CANNOT_CONTINUE it has just broken a user code.
> 2) Mandating that every MPI call after a MPI_ERR_CANNOT_CONTINUE must return MPI_ERR_CANNOT_CONTINUE will require that every MPI call check a  global flag (resulting in overhead and possible displacement of other data from cache)
> 
> 
> Dick Treumann  -  MPI Team
> IBM Systems & Technology Group
> Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
> Tele (845) 433-7846         Fax (845) 433-8363
> 
> 
> 
> From: Darius Buntinas <buntinas at mcs.anl.gov> <mailto:buntinas at mcs.anl.gov>
> To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" <mpi3-ft at lists.mpi-forum.org> <mailto:mpi3-ft at lists.mpi-forum.org>
> Date: 09/22/2010 05:47 PM
> Subject: Re: [Mpi3-ft] Defining the state of MPI after an error
> Sent by: mpi3-ft-bounces at lists.mpi-forum.org
> ________________________________
> 
> 
> 
> 
> On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:
> 
> 
> 
> 
> You lost me there - in part, i am saying it is useless because there are almost zero cases in which it would be appropriate.  How does that make it "a minor change"?
> 
> 
> 
> 
> Well I figure we're just adding an error class that the implementation can return to the user if it gives up and can't continue.  That's minor.  Whether or not it's useful is another story :-)
> 
> 
> 
> 
> Can you provide me the precise text you would add to the standard? Exactly how does the CANNOT_CONTINUE work?  Under what conditions does an MPI process see a CANNOT_CONTINUE and what does it mean?
> 
> 
> 
> 
> I don't know yet.  It might be something as simple as adding an entry to the error class table with a description like:
> 
>    Process can no longer perform any MPI operations.  If an MPI operation
>    returns this error class, all subsequent calls to MPI functions will
>    return this error class.
> 
> 
> 
> 
> Please look at the example again.  The point was that there is nothing there that would justify a CANNOT_CONTINUE and MPI is still working correctly. Despite that, the behavior is a mess from the algorithm viewpoint after the error.
> 
> 
> 
> 
> Since we haven't defined what happens in a failed collective yet, consider an implementation could will not continue after a failed collective.  The odd numbered processes that did not immediately return from barrier with an error will continue with the barrier protocol (say it's recursive doubling).  Some of the odd processes will need to send messages to some of the even processes.  Upon receiving these messages, the even processes will respond with an I_QUIT message, or perhaps the connection is closed, so the odd processes will get a communication error when trying to send the message.  In either case, the odd processes will notice that something's wrong with the other processes, and return an error.  The second barrier will return a CANNOT_CONTINUE on all of the processes.
> 
> OK, what if the odd processes can't determine that the even processes can't continue?  The odd processes would hang in the first barrier, and the even numbered processes would get a CANNOT_CONTINUE from the second barrier.
> 
> So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided the discombobulated scenario.
> 
> -d
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft <http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft>
> 
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 
> 
> 
> _______________________________________________
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey