[Mpi3-ft] Run-Through Stabilization Users Guide

Bronevetsky, Greg bronevetsky1 at llnl.gov
Tue Feb 1 10:03:29 CST 2011

> I do see the point though. I wonder if an additional, combination function
> would be useful for this case. Something like:
> MPI_COMM_VALIDATE_ALL_FULL_REPORT(comm, incount, outcount, totalcount,
> rank_infos)
> comm: communicator
> incount: size of rank_infos array
> outcount: number of rank_info entries filled
> totalcount: total number of failures known
> rank_infos: array of MPI_Rank_info types
> This would allow the user to determine if they are getting a report of all
> the failures (outcount == totalcount), or just a subset because they did
> not supply a sufficiently allocated buffer (outcount < totalcount). This
> does force the user to allocate the rank_infos buffer before it may be
> needed, but if the type of consistency that you cite is needed then maybe
> this is not a problem.
This would do the job but would have the memory allocation problem that you describe. Another was to do this would be to provide something like an MPI_Status object to MPI_COMM_VALIDATE_ALL and provide functions to allow the user to further interrogate it. This object would not keep a list of all the failed ranks but instead would correspond to a point in time where some failures occur before this point and others after this point. This would allow the object to stay small.

One larger question about memory: Are we worried that MPI's internal storage to keep track of failed processes will grow out of proportion? In the limit, this is certainly a problem (e.g. half the ranks are dead) but in practice I'm not sure if we care. If MPI is running out of space to store the ranks of dead processes, are we comfortable with the idea of MPI actively killing ranks so that it can compress its internal representation?

> > Would it be better to explicitly have separate states for failures that
> have been recognized locally and collectively?
> I don't think so. I think this starts to become confusing to the user, and
> muddles the semantics a bit. If global/collective recognition does not
> imply local recognition, then do we require that the user both locally and
> globally recognize a failure before they can create a new communicator?
> What if the communicator has a global recognition of failures, but not
> locally? In that case collectives will succeed, but only some point-to-
> point operations. This seems to be adding more work for the application
> without a clear use case on when it would be required.
You can separate the two concepts while still ensuring that collective recognition implies local recognition. It would just be a hierarchy of recognition levels (none < local < collective on individual communicators < collective on all communicators), where a given rank's position is always clearly defined.

> > In the broadcast example it would be easier if both examples used the
> same bcast algorithm. You don't really explain how the algorithm works, so
> it'll be easier to understand if you don't switch them.
> I can see that. I think explaining the way that different implementations
> may cause things to go wonky might be useful. I cleaned up the language a
> bit to describe the two algorithms and why we discuss them here. I would be
> fine with dropping one of the bcast illustrations if you all think it is
> still too confusing.
I don't think it's a big deal. Our developers are fairly familiar with communication protocols so I doubt that it will really confuse them.

Greg Bronevetsky
Lawrence Livermore National Lab
(925) 424-5756
bronevetsky at llnl.gov

More information about the mpiwg-ft mailing list