[Mpi3-ft] Run-Through Stabilization Users Guide
jjhursey at open-mpi.org
Tue Feb 1 15:18:39 CST 2011
On Feb 1, 2011, at 11:03 AM, Bronevetsky, Greg wrote:
>> I do see the point though. I wonder if an additional, combination function
>> would be useful for this case. Something like:
>> MPI_COMM_VALIDATE_ALL_FULL_REPORT(comm, incount, outcount, totalcount,
>> comm: communicator
>> incount: size of rank_infos array
>> outcount: number of rank_info entries filled
>> totalcount: total number of failures known
>> rank_infos: array of MPI_Rank_info types
>> This would allow the user to determine if they are getting a report of all
>> the failures (outcount == totalcount), or just a subset because they did
>> not supply a sufficiently allocated buffer (outcount < totalcount). This
>> does force the user to allocate the rank_infos buffer before it may be
>> needed, but if the type of consistency that you cite is needed then maybe
>> this is not a problem.
> This would do the job but would have the memory allocation problem that you describe. Another was to do this would be to provide something like an MPI_Status object to MPI_COMM_VALIDATE_ALL and provide functions to allow the user to further interrogate it. This object would not keep a list of all the failed ranks but instead would correspond to a point in time where some failures occur before this point and others after this point. This would allow the object to stay small.
But then how does the user communicate that they don't need this information any longer? Since the MPI implementation will have to save the state associated with every call MPI_Comm_validate_all in case the user wants the information from one or more invocations. We may get away with making the rule that it is only associated with the last call, but apps might want the ability to essentially do a diff of the lists from one or more previous invocations.
An alternative solution with the above example would be to use MPI_Comm_validate_all to get the count, then use MPI_Comm_validate_all_full_report to get a consistent list. This would remove the inconsistent in-between state, but at the cost of two collectives over the same communicator, which I like less than the memory allocation solution.
> One larger question about memory: Are we worried that MPI's internal storage to keep track of failed processes will grow out of proportion? In the limit, this is certainly a problem (e.g. half the ranks are dead) but in practice I'm not sure if we care. If MPI is running out of space to store the ranks of dead processes, are we comfortable with the idea of MPI actively killing ranks so that it can compress its internal representation?
I'm not so worried about the memory requirements at the moment (and in practice it is not a big issue right now). I think the more pressing problem is how an MPI implementation is going to track a very large number of processes in general. Once we implementors figure out a way to track membership in a very large communicator (where membership cannot be fully known locally), adding a small amount of extra state to the distributed store to determine if the process is OK/FAILED/NULL should be considered in such a solution. But I don't think the proposed interface requires the MPI implementation to do anything too exhaustive unless the application asks for a full list of a large number of failures.
I think we should steer clear of even hinting to the application that we are going to kill processes so that we can save memory. Now the process could fail because MPI tried to allocate memory and ran out - at which point the MPI should return an error, and shutdown services - Maybe a good use case for CANNOT_CONTINUE. That situation is slightly different that proactively killing processes because we don't want to/cannot track their state. That sounds like a sure fire way to make users flee from the MPI to something else.
As far as the application is concerned, it only needs to allocate an array of MPI_Rank_info objects the size of that retuned by MPI_Comm_validate_all. Which should be small, but is inclusive of all the known failures (recognized and unrecognized).
We could add a flag to the local MPI_Comm_validate that would return a list of the unrecognized failures, instead of all failures. Or, a bit more specifically, a 'state' key argument to the function, and it will return only those processes that match that state. This would allow the application to manage smaller sets of failures without having to iterate over previously recognized failures. What do you think about that?
>>> Would it be better to explicitly have separate states for failures that
>> have been recognized locally and collectively?
>> I don't think so. I think this starts to become confusing to the user, and
>> muddles the semantics a bit. If global/collective recognition does not
>> imply local recognition, then do we require that the user both locally and
>> globally recognize a failure before they can create a new communicator?
>> What if the communicator has a global recognition of failures, but not
>> locally? In that case collectives will succeed, but only some point-to-
>> point operations. This seems to be adding more work for the application
>> without a clear use case on when it would be required.
> You can separate the two concepts while still ensuring that collective recognition implies local recognition. It would just be a hierarchy of recognition levels (none < local < collective on individual communicators < collective on all communicators), where a given rank's position is always clearly defined.
Ah I see what you are saying now. So add one more state in order to the taxonomy:
MPI_RANK_STATE_OK Normal running state
MPI_RANK_STATE_FAILED Failed, has not been recognized/cleared
MPI_RANK_STATE_LOCAL_NULL Failed, has been recognized/cleared locally
MPI_RANK_STATE_NULL Failed, has been recognized/cleared globally
OK -> FAILED : When process failure is detected
FAILED -> LOCAL_NULL: When locally validated
LOCAL_NULL -> NULL : When collectively validated
FAILED -> NULL : When collectively validated
NULL -> LOCAL_NULL: A new failure occurs in the communicator (?)
When a new failure occurs in a communicator do all the ranks that were previously in MPI_RANK_STATE_NULL (globally recognized) move to MPI_RANK_STATE_LOCAL_NULL (locally recognized only)? So the transition diagram could have a loop between the LOCAL_NULL and NULL states. Something like:
OK -> FAILED -> LOCAL_NULL <-> NULL or
OK -> FAILED -> NULL <-> LOCAL_NULL
It makes the state transitions a little more complex, but does allow the user to determine if the process has been locally or globally cleared. The question really is, is it useful to know that an individual rank is globally cleared?
Globally cleared really only has meaning at the full communicator level (used to determine if collectives are enabled/disabled). So wouldn't it be better to have a way to query the communicator to see if it was globally cleared or not, in order for the application to decide if collectives can be used or not. We talked a while back about adding an attribute to the communicator to indicate this, but I don't think it ever made it into the proposal. I'll make a note about this, and maybe we can discuss what it should look like next week.
>>> In the broadcast example it would be easier if both examples used the
>> same bcast algorithm. You don't really explain how the algorithm works, so
>> it'll be easier to understand if you don't switch them.
>> I can see that. I think explaining the way that different implementations
>> may cause things to go wonky might be useful. I cleaned up the language a
>> bit to describe the two algorithms and why we discuss them here. I would be
>> fine with dropping one of the bcast illustrations if you all think it is
>> still too confusing.
> I don't think it's a big deal. Our developers are fairly familiar with communication protocols so I doubt that it will really confuse them.
> Greg Bronevetsky
> Lawrence Livermore National Lab
> (925) 424-5756
> bronevetsky at llnl.gov
> mpi3-ft mailing list
> mpi3-ft at lists.mpi-forum.org
Postdoctoral Research Associate
Oak Ridge National Laboratory
More information about the mpiwg-ft