[Mpi-22] [MPI Forum] #96: MPI_OP_CREATE and associativity

Mon Jan 12 13:05:47 CST 2009

On Mon, Jan 12, 2009 at 05:11:24PM +0100, Jesper Larsson Traeff wrote:
> On Mon, Jan 12, 2009 at 10:36:58AM -0500, Torsten Hoefler wrote:
> > 
> > To optimize this, and retain the order of summation, one could employ
> > similar strategies as for gather to reduce the messaging complexity
> > (e.g. from O(N) to O(log_N)). 
> Since you cannot do anything on intermediate nodes in your gather tree,
> the message complexity is O(Nm), m: size of vector... 
yes - details actually depend on the model :). Let's use the well-known
LogGP with the extension, that the reduction per byte costs x mus as a
base for discussions. I talked about a direct vs. tree algorithm (which
of course only makes sense for small data!):

I assume that all nodes start their participation in the call at the
same time and the root is the one who finishes last. I don't model CPU
overhead because this doesn't add more information.

Direct:       t=L + (P-2)g + (s-1)G + P*s*x
BinomialTree: t=log_2(P)L+(log_2(P)-1)g + sum_i=1^(log_2(P)) (i*(s-1)G) + P*s*x

The direct receive is simple and needs no explanation. The longest path
in the bintree has a length of log_2(P), and the rank 0 sends to
log_2(P) nodes. The messages grow along the critical path (thus the
sum).

> You only gain something for small problems by reducing the number of
> receives at the root. 
yes, however, the definition of "small" changes with scale (see above
models). Also, the transmission of mid-sized messages can be handled in
the same model with double or fractional trees (see Karp's article
"Optimal Broadcast and Summation in the LogP Model" from '93). Yes, and
I know that this only affects data motion, the reduction has to be done
on order on rank 0. However, it is often the case that (especially for
larger messages and modern architectures such as GPUs on Chip/Larrabee),
the bandwidth of the processor is much higher than the bandwidth of the
network.

Large messages could use algorithms that control congestion (such in
large Gathers), i.e., only a subset of the processes send at the same
time. The data could also reduced in order here. 

All of those optimizations are not trivial and very network-dependent.

> In the case where you use some dynamic algorithm where things may arrive in
> some unspecified order, you'd again have to buffer the contributions from
> the processes before you can do the reduction.
As I said, this depends. And you have the same problems with the current
MPI reduction rules, i.e., you either have to buffer or to synchronize.
Out of order doesn't exist. So in this implementation, you'll either
buffer (small messages) or synchronize (large messages) as you do for
the current reductions.

> > yes, they can still be treated as associative (for all internal types),
> > but thet user should have a chance to define non-associative reductions.
> > 
> In the apps you refer to, I guess they often want an MPI_SUM on an MPI_FLOAT
> done in some particular order with a particular bracketing. Thus, what
> you need is some ability to control each individual MPI_Reduce call?
yes, but I guess controlling each call is not necessary. However,
controlling the order of operations might be necessary to ensure
correctness. But I don't know a particular app, so the whole argument is
weak. I just want to discuss the issue and either understand why we
don't have this feature or discuss its addition. My point is that there
are (limited) optimization possibilities and that we limit ourselves
enourmously by enforcing the same reduction order for all float/double
operations and all user-defined operations in MPI-2.1. If associativity
would be an explicit property of a user-defined operation, then we could
apply dynamic optimizations based on arrival time, which we can't right
now (side-note: we can do this for associative ops, such as all ops on
non-fp types). 

Thanks & Best,
  Torsten

-- 
 bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler       | Postdoctoral Researcher
Open Systems Lab      | Indiana University    
150 S. Woodlawn Ave.  | Bloomington, IN, 474045, USA
Lindley Hall Room 135 | +01 (812) 855-3608

* 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.mpi-forum.org/pipermail/mpi-22/attachments/20090112/f5ca6134/attachment.pgp>