[Mpi3-ft] Defining the state of MPI after an error

Richard Treumann treumann at us.ibm.com
Thu Sep 23 08:24:54 CDT 2010


Each ticket that comes before the Forum for an up or down vote must stand 
on its own merits. Discussions cannot change the standard. That requires a 
ticket to be written and  voted.

I think no Forum member should vote YES on a ticket unless they think it 
would be an improvement to the MPI standard EVEN IF every other pending 
ticket gets defeated. 

If the concept in the ticket is not an improvement without some broader 
new feature (like FT), there is no reason to approve it until the rest of 
the new feature is ready for a vote. At that time it should be folded into 
the broad proposal that provides the justification.

My position at this point is that it is the responsibility of those who 
want to make a "minor" change to the standard to develop a vote-ready 
ticket and then it becomes the responsibility of the entire Forum to 
debate whether that ticket should be approved.

I am willing to read and comment on a draft of a ticket before it is 
entered into the database but I would like to be assured that the authors 
consider it essentially complete and justified independent of any 
reference to what may be in the FT chapter.


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

"Graham, Richard L." <rlgraham at ornl.gov>
"MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org>
09/23/2010 06:39 AM
Re: [Mpi3-ft] Defining the state of MPI after an error
Sent by:
mpi3-ft-bounces at lists.mpi-forum.org

Thanks for the comments - responses in line

On 9/23/10 5:45 AM, "Terry Dontje" <terry.dontje at oracle.com> wrote:

Just saw Dick's email and I guess I will push this rock a little further 
myself and start attending the FT calls if I can.

Graham, Richard L. wrote:

  What are your objections here ?  All the current proposal is doing is 
trying to define a set of consistent return codes, and is not changing 
anything about MPI.  I am not sure there is sufficient information to act 
on in all cases with the current error handling in MPI, but may be wrong 
on this.  However, there is nothing else that is being proposed at this 

My main objection is that what I heard been discussed does not seem to add 

[rich] Yes, on it's own it adds little value, but not totally useless, but 
this is not an end of its own.  Both you and Richard missed the context of 
the MPI Forum - not sure why you could not pick this up from the ether :-)

  Also, if no changes are made to current implementations, there is a 
chance that apps will hang, but this is no different than if users set 
errors return today.  Which would need to do be done to see the effects of 
the changes.
  Now, it is fair to ask if this really adds something.  If there is 
intent to recover from errors, which is what the FT working group is 
trying to figure out, then this has a lot of value in that it is the venue 
for an initial discussion of how to extend the error classes, which is 
really what Josh has been trying to do.  I believe we need an 
implementation and some app experiments to find what we have missed.

If this is being proposed just to bootstrap the discussion then fine I 
think, at least for the group, you've gotten some traction here ;-).


  As for the choice of "MPI_ERR_CANNOT_CONTINUE" how is this any different 
than malloc returning null ?
The problem as I see it is there might be very few times a library is 
completely out of commission and probably fewer times than that when the 
library actually know it is completely out of commission.  Now maybe the 
point here is that this acts as a bridge between the current state of a 
library that supports MPI 2.2 to when a library supports all of the FT 
parts of MPI 3.0.  That is, as I mentioned before, all libraries that have 
questions as to there FT support should automatically start returning 
MPI_ERR_CANNOT_CONTINUE after most/every error until they can fully 
support FT.

[rich] Yes, this is the point.  For the sort of environment I work in, the 
network stack returns sufficient information so that the library can know 
what is going on.  The supporting run-times are also used to detect/notify 
when process fail, so the library can at least detect state.  Responding 
does involve some changes to the library.  I don't think it is 
unreasonable to respond with a "non-fatal" error code when resource 
requests can't be satisfied.

Ok to be fair, there are probably libraries (OMPI) that do not necessarily 
handle the loss of a connection very well in certain cases and you could 
possibly mark a global to say the MPI library is borked in that case. 
However, if I was going to map out supporting MPI 3.0 FT I'd probably 
implement the whole thing taking into consideration these cases and 
returning the appropriate error instead of relying on the 

[rich] Not sure that I would mark global state in this manner.  But this 
really depends on how far the implementation wants to go just for this 
change.  I believe the current proposal is a starting point for both the 
app and the library to start addressing the issues associated with 
failure.  I expect that once you get into the guts of the library, and try 
to get beyond the equivalent of one big lock for threading, a pile of 
implementation issues will come up, and this is where the real value of 
this proposal is.

 It tells the user that the library is no longer functional, and leaves it 
to the app to decide how to respond.  Depending on the implementation, 
there are "error" scenarios that both the app and the MPI library can 
survive.  Failure of alloc_mem may be such a function.  An app may also 
decide that sending data to a specific destination may  also be ok - I can 
give a several real use cases that were brought to us as we were looking 
into this that would be just fine with this.  Now the collective 
operations are another question.
  So, all this proposal is really doing is start to revive the FT 
discussion at the Forum level, as partial implementation is getting to a 
state that it can be evaluated.  This is really why it is important to 
understand the specific shortcomings you see in what is being proposed - 
just the error propagation issues.

Ok, so I see two issues off the bat mentioned:

1.  Will an MPI library really know or ever want to throw 
     IMO, I think the answer should be no in that an MPI library should 
throw errors that are
     specific to whether a communicator or a connection to a rank is 
operable.  Because I think
     there are no cases (except for bugs in the library) that a library is 
completely borked
     such that nothing can be done.  I'll admit today that may not 
completely be true but in a 3.0
     world shouldn't it?

[rich] I think that experience will guide us here.   I would tend to agree 
with you, if the library is aiming to recover from failures, but am 
reluctant to make a sweeping statement w/o more evidence.  Libraries that 
do not want to support recovery are the candidates for returning such 
error codes.

2.  The point Dick makes about the checking of a global value to determine 
if a library
     is borked could mess up cache performance is also something we should 
be concerned with.
[rich] Agreed - I would not implement this in that manner.  This is ok for 
some prototyping, but I would tend to implement things in a manner that 
when failure occurs, the library may do find out about the error deep down 
in the library, so that the common cause does not take the performance 
hit.  I can actually see how to do this for pt-2-pt communications, but 
have not thought about collectives, for file ops.




On 9/22/10 6:55 PM, "Richard Treumann" <treumann at us.ibm.com> <
mailto:treumann at us.ibm.com>  wrote:

We are kind of going in circles because the context and rationale for 
CANNOT_CONTINUE are still too ambiguous.

My argument is against adding it into the standard first and figuring out 
later what it means.

I will wait for the ticket. If the ticket gives a full and convincing 
specification of what the implementor and the user are to do with it,, I 
will make my judgement based on the whole description.

If the ticket says "Put this minor change in today and we will decide 
later what it means, I must lobby the Forum to reject the ticket..

1)  all current errors detected by an MPI application map to an existing 
error class. An error cannot map to two error classes so if some user 
error handler is presently checking for MPI_ERR_OP after a non-SUCCESS 
return from MPI_Reduce and the implementation moves the return code for 
passing a bad OP from class MPI_ERR_OP to MPI_ERR_CANNOT_CONTINUE it has 
just broken a user code.
2) Mandating that every MPI call after a MPI_ERR_CANNOT_CONTINUE must 
return MPI_ERR_CANNOT_CONTINUE will require that every MPI call check a 
global flag (resulting in overhead and possible displacement of other data 
from cache)

Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363

From: Darius Buntinas <buntinas at mcs.anl.gov> <mailto:buntinas at mcs.anl.gov>
To: "MPI 3.0 Fault Tolerance and Dynamic Process Control working Group" 
<mpi3-ft at lists.mpi-forum.org> <mailto:mpi3-ft at lists.mpi-forum.org>
Date: 09/22/2010 05:47 PM
Subject: Re: [Mpi3-ft] Defining the state of MPI after an error
Sent by: mpi3-ft-bounces at lists.mpi-forum.org

On Sep 22, 2010, at 2:29 PM, Richard Treumann wrote:

You lost me there - in part, i am saying it is useless because there are 
almost zero cases in which it would be appropriate.  How does that make it 
"a minor change"?

Well I figure we're just adding an error class that the implementation can 
return to the user if it gives up and can't continue.  That's minor. 
Whether or not it's useful is another story :-)

Can you provide me the precise text you would add to the standard? Exactly 
how does the CANNOT_CONTINUE work?  Under what conditions does an MPI 
process see a CANNOT_CONTINUE and what does it mean?

I don't know yet.  It might be something as simple as adding an entry to 
the error class table with a description like:

    Process can no longer perform any MPI operations.  If an MPI operation
    returns this error class, all subsequent calls to MPI functions will
    return this error class.

Please look at the example again.  The point was that there is nothing 
there that would justify a CANNOT_CONTINUE and MPI is still working 
correctly. Despite that, the behavior is a mess from the algorithm 
viewpoint after the error.

Since we haven't defined what happens in a failed collective yet, consider 
an implementation could will not continue after a failed collective.  The 
odd numbered processes that did not immediately return from barrier with 
an error will continue with the barrier protocol (say it's recursive 
doubling).  Some of the odd processes will need to send messages to some 
of the even processes.  Upon receiving these messages, the even processes 
will respond with an I_QUIT message, or perhaps the connection is closed, 
so the odd processes will get a communication error when trying to send 
the message.  In either case, the odd processes will notice that 
something's wrong with the other processes, and return an error.  The 
second barrier will return a CANNOT_CONTINUE on all of the processes.

OK, what if the odd processes can't determine that the even processes 
can't continue?  The odd processes would hang in the first barrier, and 
the even numbered processes would get a CANNOT_CONTINUE from the second 

So we either get a hang, or everyone gets a CANNOT_CONTINUE but we avoided 
the discombobulated scenario.


mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft <

mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org

mpi3-ft mailing list
mpi3-ft at lists.mpi-forum.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-ft/attachments/20100923/3283788f/attachment-0001.html>

More information about the mpiwg-ft mailing list