[Mpi-forum] large count support not as easy as people seem to have thought

Tue May 6 16:34:41 CDT 2014

On 05/06/2014 04:14 PM, Jeff Hammond wrote:
> On Tue, May 6, 2014 at 3:54 PM, Hjelm, Nathan T <hjelmn at lanl.gov> wrote:
>> +1 for large counts. I find it just a bit ridiculous that this was punted on by the forum but I wasn't around for the discussion. Is this in issue forced on C/C++ by fortran?
>
> Unfortunately, I don't think we can blame this one on Fortran.  Users
> want to pass large counts from the C interface.  That's the motivation
> for BigMPI, at least.

i imagine one day (once again?) the c int type will be 64 bits on a 
platform, right?

or will c ints be 32 bits for ever?

==rob

>
> The Fortran users that want -i8 and friends to work with the MPI
> Fortran interface are another issue that doesn't need new functions in
> the standard, just a bunch of interface gymnastics that most
> implementers won't enjoy.
>
> Best,
>
> Jeff
>
>> As I noted on this list last Friday, I am working on a higher-level
>> library to support large counts for MPI communication functions
>> (https://github.com/jeffhammond/BigMPI).
>>
>> In the course of actually trying to implement this the way the Forum
>> contends it can be done - i.e. using derived-datatypes - I have found
>> some issues that undermine the Forum's contention that it is so easy
>> for users to do it that it doesn't need to be in the standard.
>>
>> To illustrate some of the issues that I have found, let us consider
>> the large-count implementation of nonblocking reduce...
>>
>> # Example Use Case
>>
>> It is entirely reasonable to think that some quantum chemist will want
>> to reduce a contiguous buffer of more than 2^31 doubles corresponding
>> to the Fock matrix if they have a multithreaded code, since 16 GB is
>> not an unreasonable amount of memory per node.
>>
>> # Issues
>>
>> Unlike the blocking case, where it is reasonable to chop up the data
>> and performance multiple operations (e.g.
>> https://github.com/jeffhammond/BigMPI/blob/master/src/reductions_x.c,
>> assuming that untested code is correct), one must return a single
>> request to the application if one is to implement MPIX_I(all)reduce_x
>> with the same semantics as MPI_Iallreduce, as I aspire to do in
>> BigMPI.
>>
>> Issue #1: chopping doesn't work for nonblocking.
>>
>> To do the large-count reduction in one nonblocking MPI call, a derived
>> datatype is required.  However, unlike in RMA, reductions cannot use
>> built-in ops for user-defined datatypes, even if they are trivially
>> composed of a large-count of built-in datatypes.  See
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338 and
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
>> commentary on why this semantic mismatch is lame.
>>
>> Issue #2: cannot use built-in reduce ops.
>>
>> Once we rule out using built-in ops with our large-count datatypes, we
>> must reimplement all of the reduction operations required.  I find
>> this to be nontrivial.  I have not yet figured out how to get at the
>> underlying datatype info in a simple manner.  It appears that
>> MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
>> have to call this function when all I need to know is the number of
>> built-in datatypes so that I can apply my clever and use
>> MPI_Reduce_local inside of my user-defined operation.
>>
>> Issue #3: implementing the user-defined reduce op isn't easy (in my opinion).
>>
>> Many MPI implementations optimize reductions.  On Blue Gene/Q, MPI has
>> explicitly vectorized intrinsic/assembly code.  Unless
>> MPI_Reduce_local hits that code path, I am losing a huge amount of
>> performance in reductions when I go from 2^31 to 2^31+1 elements.  I
>> would not be surprised at all if user-defined ops+datatypes exercises
>> suboptimal code paths in many MPI implementations, which means that
>> the performance of nonblocking reductions is unnecessarily crippled.
>>
>> Issue #4: inability to use optimizations in the MPI implementation.
>>
>> # Conclusion
>>
>> I believe this problem is best addressed in one of two ways:
>>
>> 1) Approve the semantic changes requested in tickets 34 and 338 so
>> that one can use built-in ops with homogeneous user-defined datatypes.
>>   This is my preference for multiple reasons.
>>
>> 2) Add large-count reductions to the standard.  This means 8 new
>> functions: blocking and nonblocking (all)reduce and
>> reduce_scatter(_block).  We don't need large-count functions for any
>> other collectives because the datatype solution works just fine there,
>> as I've already demonstrated in BigMPI
>> (https://github.com/jeffhammond/BigMPI/blob/master/src/collectives_x.c).
>>
>> # Social Commentary
>>
>>  From now on, when the Forum punts on things and says it's no problem
>> for users to roll their own using the existing functionality in MPI,
>> we should strive to be a bit more diligent and actually prototype that
>> implementation in a manner that proves how easy it is for users.  It
>> turns out, writing code for some things is harder than just talking
>> about them in a conference room...
>>
>> # Related
>>
>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338#comment:9
>> captures some of this feedback in Trac.
>>
>> I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
>> the reasons described therein.
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
>
>
>

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA