[Mpi-forum] large count support not as easy as people seem to have thought
Adam T. Moody
moody20 at llnl.gov
Tue May 6 20:37:29 CDT 2014
Thanks so much for taking on this challenge!
I think things are even worse than you've noted. In fact, it may be
I had considered this a bit once before, and when I ran a
gedankenexperiment on my own MPI_Type_contiguous_x, I got caught up in
all of the complexities of portably* dealing with the various integer types.
Just in this one function, you need to deal with int, MPI_Aint, and
MPI_Count. In your particular implementation, you have to be sure that
you don't overflow "int" in the call to MPI_Type_vector and you have to
be sure you don't overflow MPI_Aint in the call to MPI_Type_create_struct.
It gets ugly because MPI_Count can be much larger than the other two
types. MPI_Count has to be big enough to hold the largest value of int,
MPI_Aint, and MPI_Offset. MPI_Aint has to be large enough to represent
every address on a system, and int is of course often 32 bits but it's
only guranteed to be at least 16 bits as George pointed out.
So now imagine MPI running on a very large system composed of many, many
small-memory cores. That's a very plausible future system design. On
such a system, it's possible that you may have 32-bit ints, 32-bit
MPI_Aints (small memory), but 128-bit MPI_Offsets. On such a system,
MPI_Count / (MAX_INT) would still be more than what fits in an int, so
then you overflow the first parameter to MPI_Type_vector.
Because MPI_Count can be much bigger than both int and MPI_Aint, I
believe this problem becomes very ugly (and perhaps impossible) to solve
in a portable way. Even implementing something as simple as
MPI_Type_contiguous_x is very cumbersome.
I believe (but have not proved) that there are some large count values
which are *impossible* to represent with any combination of existing MPI
calls due to these limitations. Ultimately, you need to piece together
arbitrary types with something like MPI_Type_create_struct, but this
takes an MPI_Aint to express displacements from the front, so once you
have types that are well bigger than an MPI_Aint my guess is that there
are probably some you can't define at all.
Jeff Hammond wrote:
>As I noted on this list last Friday, I am working on a higher-level
>library to support large counts for MPI communication functions
>In the course of actually trying to implement this the way the Forum
>contends it can be done - i.e. using derived-datatypes - I have found
>some issues that undermine the Forum's contention that it is so easy
>for users to do it that it doesn't need to be in the standard.
>To illustrate some of the issues that I have found, let us consider
>the large-count implementation of nonblocking reduce...
># Example Use Case
>It is entirely reasonable to think that some quantum chemist will want
>to reduce a contiguous buffer of more than 2^31 doubles corresponding
>to the Fock matrix if they have a multithreaded code, since 16 GB is
>not an unreasonable amount of memory per node.
>Unlike the blocking case, where it is reasonable to chop up the data
>and performance multiple operations (e.g.
>assuming that untested code is correct), one must return a single
>request to the application if one is to implement MPIX_I(all)reduce_x
>with the same semantics as MPI_Iallreduce, as I aspire to do in
>Issue #1: chopping doesn't work for nonblocking.
>To do the large-count reduction in one nonblocking MPI call, a derived
>datatype is required. However, unlike in RMA, reductions cannot use
>built-in ops for user-defined datatypes, even if they are trivially
>composed of a large-count of built-in datatypes. See
>https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
>commentary on why this semantic mismatch is lame.
>Issue #2: cannot use built-in reduce ops.
>Once we rule out using built-in ops with our large-count datatypes, we
>must reimplement all of the reduction operations required. I find
>this to be nontrivial. I have not yet figured out how to get at the
>underlying datatype info in a simple manner. It appears that
>MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
>have to call this function when all I need to know is the number of
>built-in datatypes so that I can apply my clever and use
>MPI_Reduce_local inside of my user-defined operation.
>Issue #3: implementing the user-defined reduce op isn't easy (in my opinion).
>Many MPI implementations optimize reductions. On Blue Gene/Q, MPI has
>explicitly vectorized intrinsic/assembly code. Unless
>MPI_Reduce_local hits that code path, I am losing a huge amount of
>performance in reductions when I go from 2^31 to 2^31+1 elements. I
>would not be surprised at all if user-defined ops+datatypes exercises
>suboptimal code paths in many MPI implementations, which means that
>the performance of nonblocking reductions is unnecessarily crippled.
>Issue #4: inability to use optimizations in the MPI implementation.
>I believe this problem is best addressed in one of two ways:
>1) Approve the semantic changes requested in tickets 34 and 338 so
>that one can use built-in ops with homogeneous user-defined datatypes.
> This is my preference for multiple reasons.
>2) Add large-count reductions to the standard. This means 8 new
>functions: blocking and nonblocking (all)reduce and
>reduce_scatter(_block). We don't need large-count functions for any
>other collectives because the datatype solution works just fine there,
>as I've already demonstrated in BigMPI
># Social Commentary
>From now on, when the Forum punts on things and says it's no problem
>for users to roll their own using the existing functionality in MPI,
>we should strive to be a bit more diligent and actually prototype that
>implementation in a manner that proves how easy it is for users. It
>turns out, writing code for some things is harder than just talking
>about them in a conference room...
>captures some of this feedback in Trac.
>I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
>the reasons described therein.
More information about the mpi-forum