[Mpi3-rma] RMA proposal 1 update
Rajeev Thakur
thakur at mcs.anl.gov
Fri May 21 14:37:31 CDT 2010
Lock-unlock is not collective. If we add an allflushall, how would the
processes that don't call lock-unlock call it? Just directly?
Rajeev
> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org
> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of
> Underwood, Keith D
> Sent: Friday, May 21, 2010 11:22 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> There is a proposal on the table to add flush(rank) and
> flushall() as local calls for passive target to allow
> incremental remote completion without calling unlock().
> There is an additional proposal to also include
> allflushall(), which would be a collective remote completion
> semantic added to passive target (nominally, we might not
> want to call it passive target anymore if we did that, but
> that is a different discussion). The question I had posed
> was: can you really get a measurable performance advantage
> in realistic usage scenarios by having allflushall() instead
> of just doing flushall() + barrier()? I was asking if
> someone could provide an implementation sketch - preferably
> along with a real usage scenario where that implementation
> sketch would yield a performance advantage. A
> back-of-the-envelope assessment of that would be a really
> nice to have.
>
> My contention was that the number of targets with outstanding
> requests from a given between one flush/flushall and the next
> one would frequently not be large enough to justify the cost
> of a collective. Alternatively, if messages in that window
> are "large" (for some definition of "large" that is likely
> less than 4KB and is certainly no larger than the rendezvous
> threshold), I would contend that generating a software ack
> for each one would be essentially zero overhead and would
> allow source side tracking of remote completion such that a
> flushall could be a local operation.
>
> There is a completely separate discussion that needs to occur
> on whether it is better to add collective completion to
> passive target or one-sided completion to active target (both
> of which are likely to meet some resistance because of the
> impurity the introduce into the model/naming) or whether the
> two need to be mixed at all.
>
> Keith
>
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> > Sent: Friday, May 21, 2010 7:16 AM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > It is now obvious that I did not do a sufficient job of explaining
> > things
> > before. It's not even clear to me anymore exactly what we're talking
> > about
> > - is this ARMCI? MPI2? MPI3? Active-target one-sided?
> Passive target?
> > My
> > previous explanation was for BG/P MPI2 active-target one-sided using
> > MPI_Win_fence synchronization. BG/P MPI2 implementations for
> > MPI_Win_post/start/complete/wait and passive-target
> MPI_Win_lock/unlock
> > use
> > different methods for determining target completion, although
> > internally
> > they may use some/all of the same structures.
> >
> > What was the original question?
> >
> > _______________________________________________
> > Douglas Miller BlueGene Messaging Development
> > IBM Corp., Rochester, MN USA Bldg 030-2 A410
> > dougmill at us.ibm.com Douglas Miller/Rochester/IBM
> >
> >
> >
> > "Underwood, Keith
> > D"
> > <keith.d.underwoo
> > To
> > d at intel.com> "MPI 3.0 Remote Memory Access
> > Sent by: working group"
> > mpi3-rma-bounces@
> <mpi3-rma at lists.mpi-forum.org>
> > lists.mpi-forum.o
> > cc
> > rg
> >
> > Subject
> > Re: [Mpi3-rma] RMA proposal 1
> > 05/20/2010 04:56 update
> > PM
> >
> >
> > Please respond to
> > "MPI 3.0 Remote
> > Memory Access
> > working group"
> > <mpi3-rma at lists.m
> > pi-forum.org>
> >
> >
> >
> >
> >
> >
> > My point was, the way Jeff is doing synchronization in
> NWChem is via a
> > fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD.
> If I knew
> > he
> > was going to be primarily doing this (ie, that he wanted to
> know that
> > all
> > nodes were synched), I would do something like maintain
> counts of sent
> > and
> > received messages on each node. I could then do something like an
> > allreduce
> > of those 2 ints over the tree to determine if everyone is synched.
> > There
> > are probably some technical details that would have to be
> worked out to
> > ensure this works but it seems good from 10000 feet.
> >
> > Right now we do numprocs 0-byte get operations to make sure
> the torus
> > is
> > flushed on each node. A torus operation is ~3us on a
> 512-way. It grows
> > slowly with number of midplanes. I'm sure a 72 rack longest
> Manhattan
> > distance noncongested pingpong is <10us, but I don't have
> the data in
> > front
> > of me.
> >
> > Based on Dougs email, I had assumed you would know who you
> have sent
> > messages to
If you knew that in a given fence interval
> the node had
> > only
> > sent distinct messages to 1K other cores, you would only
> have 1K gets
> > to
> > issue. Suck? Yes. Worse than the tree messages? Maybe,
> maybe not.
> > There is definitely a cross-over between 1 and np
> outstanding messages
> > between fences where on the 1 side of things the tree messages are
> > worse
> > and on the np side of things the tree messages are better. There is
> > another spectrum based on request size where getting a response for
> > every
> > request becomes an inconsequential overhead. I would have
> to know the
> > cost
> > of processing a message, the size of a response, and the cost of
> > generating
> > that response to create a proper graph of that.
> >
> > A tree int/sum is roughly 5us on a 512-way and grows
> similarly. I would
> > postulate that a 72 rack MPI allreduce int/sum is on the
> order of 10us.
> >
> > So you generate np*np messages vs 1 tree message. Contention and all
> > the
> > overhead of that many messages will be significantly worse than even
> > several tree messages.
> > Oh, wait, so, you would sum all sent and sum all received and then
> > check if
> > they were equal? And then (presumably) iterate until the answer was
> > yes?
> > Hrm. That is more interesting. Can you easily separate
> one-sided and
> > two
> > sided messages in your counting while maintaining the performance of
> > one-sided messages?
> > Dougs earlier answer implied you were going to allreduce a
> vector of
> > counts (one per rank) and that would have been ugly. I am assuming
> > you
> > would do at least 2 tree messages in what I believe you are
> describing,
> > so
> > there is still a crossover between n*np messages and m tree messages
> > (where
> > n is the number of outstanding requests between fencealls
> and 2 <= m <=
> > 10), and the locality of communications impacts that crossover
> > BTW, can you actually generate messages fast enough to
> cause contention
> > with tiny messages?
> > Anytime I know that an operation is collective, I can
> almost guarantee
> > I
> > can do it better than even a good pt2pt algorithm if I am
> utilizing our
> > collective network. I think on machines that have remote completion
> > notification an allfenceall() is just a barrier(), and since
> > fenceall();
> > barrier(); is going to be replaced by allfenceall(), it
> doesn't seem to
> > me
> > like it is any extra overhead if allfenceall() is just a
> barrier() for
> > you.
> >
> >
> > My concerns are twofold: 1) we are talking about adding collective
> > completion to passive target when active target was the one
> designed to
> > have collective completion. That is semantically and API-wise a bit
> > ugly.
> > 2) I think the allfenceall() as a collective will optimize
> to the case
> > where you have outstanding requests to everybody and I believe that
> > will be
> > slower than the typical case of having outstanding requests to some
> > people. I think that users would typically call
> allfenceall() rather
> > than
> > fenceall() + barrier() and then they would see a
> performance paradox:
> > the
> > fenceall() + barrier() could be substantially faster when you have a
> > small number of peers you are communicating with in this
> iteration.
> > I am
> > not at all worried about the overhead of allfenceall() for networks
> > with
> > remote completion.
> > Keith
> >
> >
> >
> > From: "Underwood, Keith D" <keith.d.underwood at intel.com>
> >
> > To: "MPI 3.0 Remote Memory Access working group"
> > <mpi3-rma at lists.mpi-forum.org>
> >
> > Date: 05/20/2010 09:19 AM
> >
> > Subject Re: [Mpi3-rma] RMA proposal 1 update
> > :
> >
> > Sent mpi3-rma-bounces at lists.mpi-forum.org
> > by:
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
More information about the mpiwg-rma
mailing list