[Mpi-forum] large count support not as easy as people seem to have thought

Tue May 6 20:37:29 CDT 2014

Hi Jeff,
Thanks so much for taking on this challenge!

I think things are even worse than you've noted.  In fact, it may be 
impossible.

I had considered this a bit once before, and when I ran a 
gedankenexperiment on my own MPI_Type_contiguous_x, I got caught up in 
all of the complexities of portably* dealing with the various integer types.

Just in this one function, you need to deal with int, MPI_Aint, and 
MPI_Count.  In your particular implementation, you have to be sure that 
you don't overflow "int" in the call to MPI_Type_vector and you have to 
be sure you don't overflow MPI_Aint in the call to MPI_Type_create_struct.

https://github.com/jeffhammond/BigMPI/blob/master/src/type_contiguous_x.c#L25

It gets ugly because MPI_Count can be much larger than the other two 
types.  MPI_Count has to be big enough to hold the largest value of int, 
MPI_Aint, and MPI_Offset.  MPI_Aint has to be large enough to represent 
every address on a system, and int is of course often 32 bits but it's 
only guranteed to be at least 16 bits as George pointed out.

So now imagine MPI running on a very large system composed of many, many 
small-memory cores.  That's a very plausible future system design.  On 
such a system, it's possible that you may have 32-bit ints, 32-bit 
MPI_Aints (small memory), but 128-bit MPI_Offsets.  On such a system, 
MPI_Count / (MAX_INT) would still be more than what fits in an int, so 
then you overflow the first parameter to MPI_Type_vector.

https://github.com/jeffhammond/BigMPI/blob/master/src/type_contiguous_x.c#L36

Because MPI_Count can be much bigger than both int and MPI_Aint, I 
believe this problem becomes very ugly (and perhaps impossible) to solve 
in a portable way.  Even implementing something as simple as 
MPI_Type_contiguous_x is very cumbersome.

I believe (but have not proved) that there are some large count values 
which are *impossible* to represent with any combination of existing MPI 
calls due to these limitations.  Ultimately, you need to piece together 
arbitrary types with something like MPI_Type_create_struct, but this 
takes an MPI_Aint to express displacements from the front, so once you 
have types that are well bigger than an MPI_Aint my guess is that there 
are probably some you can't define at all.
-Adam

Jeff Hammond wrote:

>As I noted on this list last Friday, I am working on a higher-level
>library to support large counts for MPI communication functions
>(https://github.com/jeffhammond/BigMPI).
>
>In the course of actually trying to implement this the way the Forum
>contends it can be done - i.e. using derived-datatypes - I have found
>some issues that undermine the Forum's contention that it is so easy
>for users to do it that it doesn't need to be in the standard.
>
>To illustrate some of the issues that I have found, let us consider
>the large-count implementation of nonblocking reduce...
>
># Example Use Case
>
>It is entirely reasonable to think that some quantum chemist will want
>to reduce a contiguous buffer of more than 2^31 doubles corresponding
>to the Fock matrix if they have a multithreaded code, since 16 GB is
>not an unreasonable amount of memory per node.
>
># Issues
>
>Unlike the blocking case, where it is reasonable to chop up the data
>and performance multiple operations (e.g.
>https://github.com/jeffhammond/BigMPI/blob/master/src/reductions_x.c,
>assuming that untested code is correct), one must return a single
>request to the application if one is to implement MPIX_I(all)reduce_x
>with the same semantics as MPI_Iallreduce, as I aspire to do in
>BigMPI.
>
>Issue #1: chopping doesn't work for nonblocking.
>
>To do the large-count reduction in one nonblocking MPI call, a derived
>datatype is required.  However, unlike in RMA, reductions cannot use
>built-in ops for user-defined datatypes, even if they are trivially
>composed of a large-count of built-in datatypes.  See
>https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338 and
>https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
>commentary on why this semantic mismatch is lame.
>
>Issue #2: cannot use built-in reduce ops.
>
>Once we rule out using built-in ops with our large-count datatypes, we
>must reimplement all of the reduction operations required.  I find
>this to be nontrivial.  I have not yet figured out how to get at the
>underlying datatype info in a simple manner.  It appears that
>MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
>have to call this function when all I need to know is the number of
>built-in datatypes so that I can apply my clever and use
>MPI_Reduce_local inside of my user-defined operation.
>
>Issue #3: implementing the user-defined reduce op isn't easy (in my opinion).
>
>Many MPI implementations optimize reductions.  On Blue Gene/Q, MPI has
>explicitly vectorized intrinsic/assembly code.  Unless
>MPI_Reduce_local hits that code path, I am losing a huge amount of
>performance in reductions when I go from 2^31 to 2^31+1 elements.  I
>would not be surprised at all if user-defined ops+datatypes exercises
>suboptimal code paths in many MPI implementations, which means that
>the performance of nonblocking reductions is unnecessarily crippled.
>
>Issue #4: inability to use optimizations in the MPI implementation.
>
># Conclusion
>
>I believe this problem is best addressed in one of two ways:
>
>1) Approve the semantic changes requested in tickets 34 and 338 so
>that one can use built-in ops with homogeneous user-defined datatypes.
> This is my preference for multiple reasons.
>
>2) Add large-count reductions to the standard.  This means 8 new
>functions: blocking and nonblocking (all)reduce and
>reduce_scatter(_block).  We don't need large-count functions for any
>other collectives because the datatype solution works just fine there,
>as I've already demonstrated in BigMPI
>(https://github.com/jeffhammond/BigMPI/blob/master/src/collectives_x.c).
>
># Social Commentary
>
>From now on, when the Forum punts on things and says it's no problem
>for users to roll their own using the existing functionality in MPI,
>we should strive to be a bit more diligent and actually prototype that
>implementation in a manner that proves how easy it is for users.  It
>turns out, writing code for some things is harder than just talking
>about them in a conference room...
>
># Related
>
>https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338#comment:9
>captures some of this feedback in Trac.
>
>I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
>the reasons described therein.
>
>  
>