[mpi3-coll] Neighborhood collectives round 2: reductions

Mon Dec 17 11:07:03 CST 2012

Jed and Torsten,

Sorry not to have responded until now but I have had a heavy travel schedule lately.

I have previously discussed a quantum mechanics application code with Torsten that might benefit from an efficient implementation of neighborhood collectives. For this code, we have a distributed array in which processes store projector data for the ions with which they overlap. Processors that overlap the same ions store redundant copies of the data but this is a vast improvement over earlier implementations in which the entire array was replicated on all the processes. After computing their contribution to the new projectors, the processes exchange and sum the data with other processes that overlap the same ions. Currently this is handled with point-to-point communication. Although better than the earlier implementations that used MPI_Allreduce, the current implementation still has substantial communication overhead. The neighborhoods are irregular and are not entirely static although they change fairly slowly. The question of whether the vector that is exchanged is the same or not depends on whether we do one communication phase per ion or all the communication is one call. Currently we do it in one call - i.e., a process sends the data for all its ions to all the processes with which it has overlap and processes just throw away the data they don't need. But since they don't have a data location for data they don't store, this is not a straight reduction but rather they go through the list and pick out what they need. The problem with doing one communication phase per ion is that the number of ions with which a process has overlap can be fairly large -- e.g., on the order of 50.

Hope this helps. Please let me know any questions or comments.

Regards,
Shirley Moore
________________________________________
From: five9a2 at gmail.com [five9a2 at gmail.com] On Behalf Of Jed Brown [jedbrown at mcs.anl.gov]
Sent: Sunday, December 16, 2012 10:53 PM
To: Torsten Hoefler
Cc: Moore, Shirley V; MPI-3 Collective Subgroup Discussions
Subject: Re: Neighborhood collectives round 2: reductions

On Sat, Dec 15, 2012 at 8:08 AM, Torsten Hoefler <htor at illinois.edu<mailto:htor at illinois.edu>> wrote:
>    Those use cases ([3]http://lists.mpi-forum.org/mpi3-coll/2011/11/0239.php)
>    were all dependent on being able to reduce to overlapping targets.
Depends on your definition of target.  If you mean processes by
"targets", then the current interface proposal provides this; if you
mean memory locations at one process by "targets", then this will not be
possible within current MPI semantics.

I mean that the memory overlaps on the processor accumulating the result of the reduction. Think of a bunch of subdomains of a regular grid with one or two cells of overlap. An example of a "reduction" is to add up the contribution from all copies of each given cell. Cells near the middle of a "face" are only shared by two processes, but corner cells are shared by several processes.

>    As for defining "identity", the operation I would like is to reduce by
>    combining with a local buffer (usually in-place destination buffer). That
>    is, if I have the local buffer
>    mine = [1.1, 2.1, 3.1, 4.1, 5.1, 6.1]
This can be expressed as a self-edge (we can discuss about in-place
arguments, but then you would need to guarantee that the local buffer is
larger than the largest neighbor buffer).

Useful application semantics would require the same.

>    and vector types for my two neighbors (defined by me)
>    incoming_type[0] = [0, 3, 4]
>    incoming_type[1] = [1, 4]
>    with incoming data (actually getting sent)
>    incoming_data[0] = [10.2, 20.2, 30.2]
>    incoming_data[1] = [100.3, 200.3]
>    the result would be
>    [op(1.1, 10.2), op(2.1, 100.3), 3.1, op(4.1, 20.2), op(5.1, 30.2, 200.3),
>    6.1]
>    This would be a natural expression of the operation I call "SFReduce" in
>    [4]http://59A2.org/files/StarForest.pdf
I see, this may be doable with the vector interface (if we remove the
restriction of equal vector sizes -- this would remove some optimization
opportunities). Can you confirm that the current proposed
neighbor_reducev() interface can cover this case?

If you remove the restriction of equal vector sizes, are you going to add an MPI_Datatype describing where to put the result? (I'd expect that to be a neighbor_reducew.) Note that in general, there would be some points shared by neighbors {1,2} and other points shared by neighbors {1,3} (and {2,3}, ...) thus we can't just sort such that the reduction is always applied to the "first N" elements.

One remaining question is if you can always guarantee "packed" data,
i.e., that the "empty" elements are always at the tail. Well, I guess
you could always add identity elements in the middle to create gaps.

I could pack, but I thought the point of the W interfaces was to enable the user to avoid packing (with possible performance advantages relative to fully-portable user code).

Also, would the static communication topology work for your use-cases
(neighborhoods don't change for a while).

Yes, comm topology is typically static, and we wouldn't use the neighborhood routines if it was changing frequently.