[Mpi-forum] large count support not as easy as people seem to have thought

Tue May 6 16:14:22 CDT 2014

On Tue, May 6, 2014 at 3:54 PM, Hjelm, Nathan T <hjelmn at lanl.gov> wrote:
> +1 for large counts. I find it just a bit ridiculous that this was punted on by the forum but I wasn't around for the discussion. Is this in issue forced on C/C++ by fortran?

Unfortunately, I don't think we can blame this one on Fortran.  Users
want to pass large counts from the C interface.  That's the motivation
for BigMPI, at least.

The Fortran users that want -i8 and friends to work with the MPI
Fortran interface are another issue that doesn't need new functions in
the standard, just a bunch of interface gymnastics that most
implementers won't enjoy.

Best,

Jeff

> As I noted on this list last Friday, I am working on a higher-level
> library to support large counts for MPI communication functions
> (https://github.com/jeffhammond/BigMPI).
>
> In the course of actually trying to implement this the way the Forum
> contends it can be done - i.e. using derived-datatypes - I have found
> some issues that undermine the Forum's contention that it is so easy
> for users to do it that it doesn't need to be in the standard.
>
> To illustrate some of the issues that I have found, let us consider
> the large-count implementation of nonblocking reduce...
>
> # Example Use Case
>
> It is entirely reasonable to think that some quantum chemist will want
> to reduce a contiguous buffer of more than 2^31 doubles corresponding
> to the Fock matrix if they have a multithreaded code, since 16 GB is
> not an unreasonable amount of memory per node.
>
> # Issues
>
> Unlike the blocking case, where it is reasonable to chop up the data
> and performance multiple operations (e.g.
> https://github.com/jeffhammond/BigMPI/blob/master/src/reductions_x.c,
> assuming that untested code is correct), one must return a single
> request to the application if one is to implement MPIX_I(all)reduce_x
> with the same semantics as MPI_Iallreduce, as I aspire to do in
> BigMPI.
>
> Issue #1: chopping doesn't work for nonblocking.
>
> To do the large-count reduction in one nonblocking MPI call, a derived
> datatype is required.  However, unlike in RMA, reductions cannot use
> built-in ops for user-defined datatypes, even if they are trivially
> composed of a large-count of built-in datatypes.  See
> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338 and
> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
> commentary on why this semantic mismatch is lame.
>
> Issue #2: cannot use built-in reduce ops.
>
> Once we rule out using built-in ops with our large-count datatypes, we
> must reimplement all of the reduction operations required.  I find
> this to be nontrivial.  I have not yet figured out how to get at the
> underlying datatype info in a simple manner.  It appears that
> MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
> have to call this function when all I need to know is the number of
> built-in datatypes so that I can apply my clever and use
> MPI_Reduce_local inside of my user-defined operation.
>
> Issue #3: implementing the user-defined reduce op isn't easy (in my opinion).
>
> Many MPI implementations optimize reductions.  On Blue Gene/Q, MPI has
> explicitly vectorized intrinsic/assembly code.  Unless
> MPI_Reduce_local hits that code path, I am losing a huge amount of
> performance in reductions when I go from 2^31 to 2^31+1 elements.  I
> would not be surprised at all if user-defined ops+datatypes exercises
> suboptimal code paths in many MPI implementations, which means that
> the performance of nonblocking reductions is unnecessarily crippled.
>
> Issue #4: inability to use optimizations in the MPI implementation.
>
> # Conclusion
>
> I believe this problem is best addressed in one of two ways:
>
> 1) Approve the semantic changes requested in tickets 34 and 338 so
> that one can use built-in ops with homogeneous user-defined datatypes.
>  This is my preference for multiple reasons.
>
> 2) Add large-count reductions to the standard.  This means 8 new
> functions: blocking and nonblocking (all)reduce and
> reduce_scatter(_block).  We don't need large-count functions for any
> other collectives because the datatype solution works just fine there,
> as I've already demonstrated in BigMPI
> (https://github.com/jeffhammond/BigMPI/blob/master/src/collectives_x.c).
>
> # Social Commentary
>
> From now on, when the Forum punts on things and says it's no problem
> for users to roll their own using the existing functionality in MPI,
> we should strive to be a bit more diligent and actually prototype that
> implementation in a manner that proves how easy it is for users.  It
> turns out, writing code for some things is harder than just talking
> about them in a conference room...
>
> # Related
>
> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/338#comment:9
> captures some of this feedback in Trac.
>
> I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
> the reasons described therein.
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum
> _______________________________________________
> mpi-forum mailing list
> mpi-forum at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi-forum

-- 
Jeff Hammond
jeff.science at gmail.com