[mpi3-coll] Neighborhood collectives round 2: reductions

Jed Brown jedbrown at mcs.anl.gov
Mon Dec 17 13:35:38 CST 2012

On Mon, Dec 17, 2012 at 10:07 AM, Moore, Shirley V <svmoore at utep.edu> wrote:

> Jed and Torsten,
> Sorry not to have responded until now but I have had a heavy travel
> schedule lately.
> I have previously discussed a quantum mechanics application code with
> Torsten that might benefit from an efficient implementation of neighborhood
> collectives. For this code, we have a distributed array in which processes
> store projector data for the ions with which they overlap. Processors that
> overlap the same ions store redundant copies of the data but this is a vast
> improvement over earlier implementations in which the entire array was
> replicated on all the processes. After computing their contribution to the
> new projectors, the processes exchange and sum the data with other
> processes that overlap the same ions. Currently this is handled with
> point-to-point communication. Although better than the earlier
> implementations that used MPI_Allreduce, the current implementation still
> has substantial communication overhead. The neighborhoods are irregular and
> are not entirely static although they change fairly slowly. The question of
> whether the vector that is exchanged is the same or not depends on whether
> we do one communication phase per ion or all the communication is one call.
> Currently we do it in one call - i.e., a process sends the data for all its
> ions to all the processes with which it has overlap and processes just
> throw away the data they don't need.

This sounds like a similar structure to the mesh-based PDE applications I
was explaining. If you determine which ions your neighbors have, you could
reduce only those that will be used, but then you need the "sparse
reduction" that I was suggesting. I don't foresee the approach of sending
everything and doing key matching on the receiver being expressible as an
MPI reduction, though you could still do a Gather with neighborhood
collectives and reduce by key locally. (This would be impractical for PDE
applications; I don't know about your case.)

> But since they don't have a data location for data they don't store, this
> is not a straight reduction but rather they go through the list and pick
> out what they need. The problem with doing one communication phase per ion
> is that the number of ions with which a process has overlap can be fairly
> large -- e.g., on the order of 50.
> Hope this helps. Please let me know any questions or comments.
> Regards,
> Shirley Moore
> ________________________________________
> From: five9a2 at gmail.com [five9a2 at gmail.com] On Behalf Of Jed Brown [
> jedbrown at mcs.anl.gov]
> Sent: Sunday, December 16, 2012 10:53 PM
> To: Torsten Hoefler
> Cc: Moore, Shirley V; MPI-3 Collective Subgroup Discussions
> Subject: Re: Neighborhood collectives round 2: reductions
> On Sat, Dec 15, 2012 at 8:08 AM, Torsten Hoefler <htor at illinois.edu
> <mailto:htor at illinois.edu>> wrote:
> >    Those use cases ([3]
> http://lists.mpi-forum.org/mpi3-coll/2011/11/0239.php)
> >    were all dependent on being able to reduce to overlapping targets.
> Depends on your definition of target.  If you mean processes by
> "targets", then the current interface proposal provides this; if you
> mean memory locations at one process by "targets", then this will not be
> possible within current MPI semantics.
> I mean that the memory overlaps on the processor accumulating the result
> of the reduction. Think of a bunch of subdomains of a regular grid with one
> or two cells of overlap. An example of a "reduction" is to add up the
> contribution from all copies of each given cell. Cells near the middle of a
> "face" are only shared by two processes, but corner cells are shared by
> several processes.
> >    As for defining "identity", the operation I would like is to reduce by
> >    combining with a local buffer (usually in-place destination buffer).
> That
> >    is, if I have the local buffer
> >    mine = [1.1, 2.1, 3.1, 4.1, 5.1, 6.1]
> This can be expressed as a self-edge (we can discuss about in-place
> arguments, but then you would need to guarantee that the local buffer is
> larger than the largest neighbor buffer).
> Useful application semantics would require the same.
> >    and vector types for my two neighbors (defined by me)
> >    incoming_type[0] = [0, 3, 4]
> >    incoming_type[1] = [1, 4]
> >    with incoming data (actually getting sent)
> >    incoming_data[0] = [10.2, 20.2, 30.2]
> >    incoming_data[1] = [100.3, 200.3]
> >    the result would be
> >    [op(1.1, 10.2), op(2.1, 100.3), 3.1, op(4.1, 20.2), op(5.1, 30.2,
> 200.3),
> >    6.1]
> >    This would be a natural expression of the operation I call "SFReduce"
> in
> >    [4]http://59A2.org/files/StarForest.pdf
> I see, this may be doable with the vector interface (if we remove the
> restriction of equal vector sizes -- this would remove some optimization
> opportunities). Can you confirm that the current proposed
> neighbor_reducev() interface can cover this case?
> If you remove the restriction of equal vector sizes, are you going to add
> an MPI_Datatype describing where to put the result? (I'd expect that to be
> a neighbor_reducew.) Note that in general, there would be some points
> shared by neighbors {1,2} and other points shared by neighbors {1,3} (and
> {2,3}, ...) thus we can't just sort such that the reduction is always
> applied to the "first N" elements.
> One remaining question is if you can always guarantee "packed" data,
> i.e., that the "empty" elements are always at the tail. Well, I guess
> you could always add identity elements in the middle to create gaps.
> I could pack, but I thought the point of the W interfaces was to enable
> the user to avoid packing (with possible performance advantages relative to
> fully-portable user code).
> Also, would the static communication topology work for your use-cases
> (neighborhoods don't change for a while).
> Yes, comm topology is typically static, and we wouldn't use the
> neighborhood routines if it was changing frequently.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-coll/attachments/20121217/ef507385/attachment-0001.html>

More information about the mpiwg-coll mailing list