[Mpi-forum] large count support not as easy as people seem to have thought
jeff.science at gmail.com
Tue May 6 12:19:37 CDT 2014
As I noted on this list last Friday, I am working on a higher-level
library to support large counts for MPI communication functions
In the course of actually trying to implement this the way the Forum
contends it can be done - i.e. using derived-datatypes - I have found
some issues that undermine the Forum's contention that it is so easy
for users to do it that it doesn't need to be in the standard.
To illustrate some of the issues that I have found, let us consider
the large-count implementation of nonblocking reduce...
# Example Use Case
It is entirely reasonable to think that some quantum chemist will want
to reduce a contiguous buffer of more than 2^31 doubles corresponding
to the Fock matrix if they have a multithreaded code, since 16 GB is
not an unreasonable amount of memory per node.
Unlike the blocking case, where it is reasonable to chop up the data
and performance multiple operations (e.g.
assuming that untested code is correct), one must return a single
request to the application if one is to implement MPIX_I(all)reduce_x
with the same semantics as MPI_Iallreduce, as I aspire to do in
Issue #1: chopping doesn't work for nonblocking.
To do the large-count reduction in one nonblocking MPI call, a derived
datatype is required. However, unlike in RMA, reductions cannot use
built-in ops for user-defined datatypes, even if they are trivially
composed of a large-count of built-in datatypes. See
https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/34 for elaborate
commentary on why this semantic mismatch is lame.
Issue #2: cannot use built-in reduce ops.
Once we rule out using built-in ops with our large-count datatypes, we
must reimplement all of the reduction operations required. I find
this to be nontrivial. I have not yet figured out how to get at the
underlying datatype info in a simple manner. It appears that
MPI_Type_get_envelope exists for this purpose, but it's a huge pain to
have to call this function when all I need to know is the number of
built-in datatypes so that I can apply my clever and use
MPI_Reduce_local inside of my user-defined operation.
Issue #3: implementing the user-defined reduce op isn't easy (in my opinion).
Many MPI implementations optimize reductions. On Blue Gene/Q, MPI has
explicitly vectorized intrinsic/assembly code. Unless
MPI_Reduce_local hits that code path, I am losing a huge amount of
performance in reductions when I go from 2^31 to 2^31+1 elements. I
would not be surprised at all if user-defined ops+datatypes exercises
suboptimal code paths in many MPI implementations, which means that
the performance of nonblocking reductions is unnecessarily crippled.
Issue #4: inability to use optimizations in the MPI implementation.
I believe this problem is best addressed in one of two ways:
1) Approve the semantic changes requested in tickets 34 and 338 so
that one can use built-in ops with homogeneous user-defined datatypes.
This is my preference for multiple reasons.
2) Add large-count reductions to the standard. This means 8 new
functions: blocking and nonblocking (all)reduce and
reduce_scatter(_block). We don't need large-count functions for any
other collectives because the datatype solution works just fine there,
as I've already demonstrated in BigMPI
# Social Commentary
>From now on, when the Forum punts on things and says it's no problem
for users to roll their own using the existing functionality in MPI,
we should strive to be a bit more diligent and actually prototype that
implementation in a manner that proves how easy it is for users. It
turns out, writing code for some things is harder than just talking
about them in a conference room...
captures some of this feedback in Trac.
I created https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/423 for
the reasons described therein.
jeff.science at gmail.com
More information about the mpi-forum