[Mpi-forum] large count support not as easy as people seem to have thought

Moody, Adam T. moody20 at llnl.gov
Tue May 6 22:08:15 CDT 2014

Hi Jeff,
Ok, thanks for the clarification.  Constraining the uses makes sense.

I had been considering the general case.  And I was also worried about dealing with 128-bit file systems.  I can imagine there may be some cases for really large datatypes in MPI I/O on 128-bit file systems, especially dealing with file views.

BTW, I did convince myself that you can at least handlecontiguous types of an arbitrary length type(k) with an inductive proof.

Base case:
  type(1) == this is a predefined type

Inductive step:
Assume all types 1-k can be created, now consider k+1:
if k+1 is even
  type(k+1) == type_contiguous(2, type(k/2))
if k+1 is odd
  lens = {1,1}
  disps = {0, extent(type(1))}
  types = {type(1), type(k)}
  type(k+1) == type_struct(2, lens, disps, types)
  since extent(type(1)) is the extent of a predefined type, it is less than MPI_Aint


From: mpi-forum [mpi-forum-bounces at lists.mpi-forum.org] on behalf of Jeff Hammond [jeff.science at gmail.com]
Sent: Tuesday, May 06, 2014 7:03 PM
To: Main MPI Forum mailing list
Subject: Re: [Mpi-forum] large count support not as easy as people seem to have thought

Hi Adam,

First, you've hit on an important shortcoming of BigMPI, which is that
I do not document its limitations in enough detail, specifically
w.r.t. int/MPI_Aint/MPI_Count.  However, the README, which is on the
home page, clearly states that BigMPI will not support counts larger
than fit into MPI_Aint, since I believe this is impossible.

I honestly do not see how 32b MPI_Aint is a valid issue with BigMPI.
If you have a 32b address space, how are you going to allocate and
index the buffer that spills the count?  I am only supporting
communication operations, not I/O ones, so there is no need to
consider the global address space size.  I suppose, in theory, you
could have a 4GiB address space and be unable to do an in-place
reduction on a 2^31+1 element array of MPI_CHAR, but does anyone care
about this corner case?

So, to be explicit, BigMPI is only going to support systems where
sizeof(int)=4 and sizeof(MPI_Aint)=8, and where the user passes count
that is less than 2^63.  I believe this is the _only_ use case worth
caring about.  As I wrote in the README, if you've got a machine with
more than 2^63 bytes of memory in it - even globally - please let me
know so I can run NWChem on it :-)

Finally, I do not believe anyone is going to build a supercomputer
that has a 32b address space in the future.  Blue Gene/P is going to
be the last of its kind.  There just isn't any rational for having 32b
addresses if there are more than 4GiB of memory per node.

It is already on my TODO list to add an autotools build system, which
is going to allow me to detect all of the relevant type sizes and make
sure the user doesn't try to do anything silly.



On Tue, May 6, 2014 at 8:37 PM, Adam T. Moody <moody20 at llnl.gov> wrote:
> Hi Jeff,
> Thanks so much for taking on this challenge!
> I think things are even worse than you've noted.  In fact, it may be
> impossible.
> I had considered this a bit once before, and when I ran a gedankenexperiment
> on my own MPI_Type_contiguous_x, I got caught up in all of the complexities
> of portably* dealing with the various integer types.
> Just in this one function, you need to deal with int, MPI_Aint, and
> MPI_Count.  In your particular implementation, you have to be sure that you
> don't overflow "int" in the call to MPI_Type_vector and you have to be sure
> you don't overflow MPI_Aint in the call to MPI_Type_create_struct.
> https://github.com/jeffhammond/BigMPI/blob/master/src/type_contiguous_x.c#L25
> It gets ugly because MPI_Count can be much larger than the other two types.
> MPI_Count has to be big enough to hold the largest value of int, MPI_Aint,
> and MPI_Offset.  MPI_Aint has to be large enough to represent every address
> on a system, and int is of course often 32 bits but it's only guranteed to
> be at least 16 bits as George pointed out.
> So now imagine MPI running on a very large system composed of many, many
> small-memory cores.  That's a very plausible future system design.  On such
> a system, it's possible that you may have 32-bit ints, 32-bit MPI_Aints
> (small memory), but 128-bit MPI_Offsets.  On such a system, MPI_Count /
> (MAX_INT) would still be more than what fits in an int, so then you overflow
> the first parameter to MPI_Type_vector.
> https://github.com/jeffhammond/BigMPI/blob/master/src/type_contiguous_x.c#L36
> Because MPI_Count can be much bigger than both int and MPI_Aint, I believe
> this problem becomes very ugly (and perhaps impossible) to solve in a
> portable way.  Even implementing something as simple as
> MPI_Type_contiguous_x is very cumbersome.
> I believe (but have not proved) that there are some large count values which
> are *impossible* to represent with any combination of existing MPI calls due
> to these limitations.  Ultimately, you need to piece together arbitrary
> types with something like MPI_Type_create_struct, but this takes an MPI_Aint
> to express displacements from the front, so once you have types that are
> well bigger than an MPI_Aint my guess is that there are probably some you
> can't define at all.
> -Adam
> Jeff Hammond wrote:
>> As I noted on this list last Friday, I am working on a higher-level
>> library to support large counts for MPI communication functions
>> (https://github.com/jeffhammond/BigMPI).
>> In the course of actually trying to implement this the way the Forum
>> contends it can be done - i.e. using derived-datatypes - I have found
>> some issues that undermine the Forum's contention that it is so easy
>> for users to do it that it doesn't need to be in the standard.
>> To illustrate some of the issues that I have found, let us consider
>> the large-count implementation of nonblocking reduce...
>> # Example Use Case
>> It is entirely reasonable to think that some quantum chemist will want
>> to reduce a contiguous buffer of more than 2^31 doubles corresponding
>> to the Fock matrix if they have a multithreaded code, since 16 GB is
>> not an unreasonable amount of memory per node.
>> # Issues
>> Unlike the blocking case, where it is reasonable to chop up the data
>> and performance multiple operations (e.g.
>> https://github.com/jeffhammond/BigMPI/blob/master/src/reductions_x.c,
>> assuming that untested code is correct), one must return a single
>> request to the application if one is to implement MPIX_I(all)reduce_x
>> with the same semantics as MPI_Iallreduce, as I aspire to do in
>> BigMPI.
>> Issue #1: chopping doesn't work for nonblocking.
>> To do the large-count reduction in one nonblocking MPI call, a derived
>> datatype is required.  However, unlike in RMA, reductions cannot use
>> built-in ops for user-defined datatypes, even if they are trivially
>> composed of a large-count of built-in datatypes.  See
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338 and
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
>> commentary on why this semantic mismatch is lame.
>> Issue #2: cannot use built-in reduce ops.
>> Once we rule out using built-in ops with our large-count datatypes, we
>> must reimplement all of the reduction operations required.  I find
>> this to be nontrivial.  I have not yet figured out how to get at the
>> underlying datatype info in a simple manner.  It appears that
>> MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
>> have to call this function when all I need to know is the number of
>> built-in datatypes so that I can apply my clever and use
>> MPI_Reduce_local inside of my user-defined operation.
>> Issue #3: implementing the user-defined reduce op isn't easy (in my
>> opinion).
>> Many MPI implementations optimize reductions.  On Blue Gene/Q, MPI has
>> explicitly vectorized intrinsic/assembly code.  Unless
>> MPI_Reduce_local hits that code path, I am losing a huge amount of
>> performance in reductions when I go from 2^31 to 2^31+1 elements.  I
>> would not be surprised at all if user-defined ops+datatypes exercises
>> suboptimal code paths in many MPI implementations, which means that
>> the performance of nonblocking reductions is unnecessarily crippled.
>> Issue #4: inability to use optimizations in the MPI implementation.
>> # Conclusion
>> I believe this problem is best addressed in one of two ways:
>> 1) Approve the semantic changes requested in tickets 34 and 338 so
>> that one can use built-in ops with homogeneous user-defined datatypes.
>> This is my preference for multiple reasons.
>> 2) Add large-count reductions to the standard.  This means 8 new
>> functions: blocking and nonblocking (all)reduce and
>> reduce_scatter(_block).  We don't need large-count functions for any
>> other collectives because the datatype solution works just fine there,
>> as I've already demonstrated in BigMPI
>> (https://github.com/jeffhammond/BigMPI/blob/master/src/collectives_x.c).
>> # Social Commentary
>> From now on, when the Forum punts on things and says it's no problem
>> for users to roll their own using the existing functionality in MPI,
>> we should strive to be a bit more diligent and actually prototype that
>> implementation in a manner that proves how easy it is for users.  It
>> turns out, writing code for some things is harder than just talking
>> about them in a conference room...
>> # Related
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338#comment:9
>> captures some of this feedback in Trac.
>> I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
>> the reasons described therein.
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

Jeff Hammond
jeff.science at gmail.com
mpi-forum mailing list
mpi-forum at lists.mpi-forum.org

More information about the mpi-forum mailing list