[Mpi3-rma] RMA proposal 1 update
Underwood, Keith D
keith.d.underwood at intel.com
Fri May 21 11:22:00 CDT 2010
There is a proposal on the table to add flush(rank) and flushall() as local calls for passive target to allow incremental remote completion without calling unlock(). There is an additional proposal to also include allflushall(), which would be a collective remote completion semantic added to passive target (nominally, we might not want to call it passive target anymore if we did that, but that is a different discussion). The question I had posed was: can you really get a measurable performance advantage in realistic usage scenarios by having allflushall() instead of just doing flushall() + barrier()? I was asking if someone could provide an implementation sketch - preferably along with a real usage scenario where that implementation sketch would yield a performance advantage. A back-of-the-envelope assessment of that would be a really nice to have.
My contention was that the number of targets with outstanding requests from a given between one flush/flushall and the next one would frequently not be large enough to justify the cost of a collective. Alternatively, if messages in that window are "large" (for some definition of "large" that is likely less than 4KB and is certainly no larger than the rendezvous threshold), I would contend that generating a software ack for each one would be essentially zero overhead and would allow source side tracking of remote completion such that a flushall could be a local operation.
There is a completely separate discussion that needs to occur on whether it is better to add collective completion to passive target or one-sided completion to active target (both of which are likely to meet some resistance because of the impurity the introduce into the model/naming) or whether the two need to be mixed at all.
> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> Sent: Friday, May 21, 2010 7:16 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> It is now obvious that I did not do a sufficient job of explaining
> before. It's not even clear to me anymore exactly what we're talking
> - is this ARMCI? MPI2? MPI3? Active-target one-sided? Passive target?
> previous explanation was for BG/P MPI2 active-target one-sided using
> MPI_Win_fence synchronization. BG/P MPI2 implementations for
> MPI_Win_post/start/complete/wait and passive-target MPI_Win_lock/unlock
> different methods for determining target completion, although
> they may use some/all of the same structures.
> What was the original question?
> Douglas Miller BlueGene Messaging Development
> IBM Corp., Rochester, MN USA Bldg 030-2 A410
> dougmill at us.ibm.com Douglas Miller/Rochester/IBM
> "Underwood, Keith
> d at intel.com> "MPI 3.0 Remote Memory Access
> Sent by: working group"
> mpi3-rma-bounces@ <mpi3-rma at lists.mpi-forum.org>
> Re: [Mpi3-rma] RMA proposal 1
> 05/20/2010 04:56 update
> Please respond to
> "MPI 3.0 Remote
> Memory Access
> working group"
> <mpi3-rma at lists.m
> My point was, the way Jeff is doing synchronization in NWChem is via a
> fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD. If I knew
> was going to be primarily doing this (ie, that he wanted to know that
> nodes were synched), I would do something like maintain counts of sent
> received messages on each node. I could then do something like an
> of those 2 ints over the tree to determine if everyone is synched.
> are probably some technical details that would have to be worked out to
> ensure this works but it seems good from 10000 feet.
> Right now we do numprocs 0-byte get operations to make sure the torus
> flushed on each node. A torus operation is ~3us on a 512-way. It grows
> slowly with number of midplanes. I'm sure a 72 rack longest Manhattan
> distance noncongested pingpong is <10us, but I don't have the data in
> of me.
> Based on Doug’s email, I had assumed you would know who you have sent
> messages to… If you knew that in a given fence interval the node had
> sent distinct messages to 1K other cores, you would only have 1K gets
> issue. Suck? Yes. Worse than the tree messages? Maybe, maybe not.
> There is definitely a cross-over between 1 and np outstanding messages
> between fences where on the 1 side of things the tree messages are
> and on the np side of things the tree messages are better. There is
> another spectrum based on request size where getting a response for
> request becomes an inconsequential overhead. I would have to know the
> of processing a message, the size of a response, and the cost of
> that response to create a proper graph of that.
> A tree int/sum is roughly 5us on a 512-way and grows similarly. I would
> postulate that a 72 rack MPI allreduce int/sum is on the order of 10us.
> So you generate np*np messages vs 1 tree message. Contention and all
> overhead of that many messages will be significantly worse than even
> several tree messages.
> Oh, wait, so, you would sum all sent and sum all received and then
> check if
> they were equal? And then (presumably) iterate until the answer was
> Hrm. That is more interesting. Can you easily separate one-sided and
> sided messages in your counting while maintaining the performance of
> one-sided messages?
> Doug’s earlier answer implied you were going to allreduce a vector of
> counts (one per rank) and that would have been ugly. I am assuming
> would do at least 2 tree messages in what I believe you are describing,
> there is still a crossover between n*np messages and m tree messages
> n is the number of outstanding requests between fencealls and 2 <= m <=
> 10), and the locality of communications impacts that crossover…
> BTW, can you actually generate messages fast enough to cause contention
> with tiny messages?
> Anytime I know that an operation is collective, I can almost guarantee
> can do it better than even a good pt2pt algorithm if I am utilizing our
> collective network. I think on machines that have remote completion
> notification an allfenceall() is just a barrier(), and since
> barrier(); is going to be replaced by allfenceall(), it doesn't seem to
> like it is any extra overhead if allfenceall() is just a barrier() for
> My concerns are twofold: 1) we are talking about adding collective
> completion to passive target when active target was the one designed to
> have collective completion. That is semantically and API-wise a bit
> 2) I think the allfenceall() as a collective will optimize to the case
> where you have outstanding requests to everybody and I believe that
> will be
> slower than the typical case of having outstanding requests to some
> people. I think that users would typically call allfenceall() rather
> fenceall() + barrier() and then they would see a performance paradox:
> fenceall() + barrier() could be substantially faster when you have a
> “small” number of peers you are communicating with in this iteration.
> I am
> not at all worried about the overhead of allfenceall() for networks
> remote completion.
> From: "Underwood, Keith D" <keith.d.underwood at intel.com>
> To: "MPI 3.0 Remote Memory Access working group"
> <mpi3-rma at lists.mpi-forum.org>
> Date: 05/20/2010 09:19 AM
> Subject Re: [Mpi3-rma] RMA proposal 1 update
> Sent mpi3-rma-bounces at lists.mpi-forum.org
More information about the mpiwg-rma