[Mpi3-ft] Run-Through Stabilization Users Guide

Mon Feb 7 16:13:36 CST 2011

> I think we should steer clear of even hinting to the application that we
> are going to kill processes so that we can save memory. Now the process
> could fail because MPI tried to allocate memory and ran out - at which
> point the MPI should return an error, and shutdown services - Maybe a good
> use case for CANNOT_CONTINUE. That situation is slightly different that
> proactively killing processes because we don't want to/cannot track their
> state. That sounds like a sure fire way to make users flee from the MPI to
> something else.
> 
In that case, how do we deal with network partitions. As I mentioned in my email to Toon, the current semantics effectively force us to kill all the processes on one side of a network partition once the partition is healed. The same is true for any process that is erroneously judged to have failed because of a transient network problem. What do we do then?

> As far as the application is concerned, it only needs to allocate an array
> of MPI_Rank_info objects the size of that retuned by MPI_Comm_validate_all.
> Which should be small, but is inclusive of all the known failures
> (recognized and unrecognized).
> 
> We could add a flag to the local MPI_Comm_validate that would return a list
> of the unrecognized failures, instead of all failures. Or, a bit more
> specifically, a 'state' key argument to the function, and it will return
> only those processes that match that state. This would allow the
> application to manage smaller sets of failures without having to iterate
> over previously recognized failures. What do you think about that?
> 
I like both ideas very much. The use of a flag for MPI_Comm_validate can also be used MPI_Comm_validate_all as well and it avoids the need for two collectives. Also, having MPI return a list would be a big improvement. The set of failed processes will be sparse in the overall set of MPI ranks, so forcing developers to allocate and iterate over arrays will be very expensive. Lists solve this problem and in are a generally more scalable way to communicate information about a few ranks from the set of all possible ranks.

> Ah I see what you are saying now. So add one more state in order to the
> taxonomy:
... 
> Globally cleared really only has meaning at the full communicator level
> (used to determine if collectives are enabled/disabled). So wouldn't it be
> better to have a way to query the communicator to see if it was globally
> cleared or not, in order for the application to decide if collectives can
> be used or not. We talked a while back about adding an attribute to the
> communicator to indicate this, but I don't think it ever made it into the
> proposal. I'll make a note about this, and maybe we can discuss what it
> should look like next week.
> 
Ah, now I see what You're saying! If any rank in a communicator is not globally cleared, then the only remedy is to clear the entire communicator. As such, the only useful API is one that checks the global-cleared status of communicators, not individual ranks. Yeah, that makes sense. Also, I think you should change the terminology from "globally" cleared to something more connected to communicators. For example, a rank may be "locally cleared" and a communicator can be "cleared" but a rank can't be "globally cleared"

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov
http://greg.bronevetsky.com