[Mpi3-rma] RMA proposal 1 update

Fri May 21 14:37:31 CDT 2010

Lock-unlock is not collective. If we add an allflushall, how would the
processes that don't call lock-unlock call it? Just directly?

Rajeev

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org 
> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of 
> Underwood, Keith D
> Sent: Friday, May 21, 2010 11:22 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> 
> There is a proposal on the table to add flush(rank) and 
> flushall() as local calls for passive target to allow 
> incremental remote completion without calling unlock().  
> There is an additional proposal to also include 
> allflushall(), which would be a collective remote completion 
> semantic added to passive target (nominally, we might not 
> want to call it passive target anymore if we did that, but 
> that is a different discussion).  The question I had posed 
> was:  can you really get a measurable performance advantage 
> in realistic usage scenarios by having allflushall() instead 
> of just doing flushall() + barrier()?  I was asking if 
> someone could provide an implementation sketch - preferably 
> along with a real usage scenario where that implementation 
> sketch would yield a performance advantage.  A 
> back-of-the-envelope assessment of that would be a really 
> nice to have.  
> 
> My contention was that the number of targets with outstanding 
> requests from a given between one flush/flushall and the next 
> one would frequently not be large enough to justify the cost 
> of a collective.  Alternatively, if messages in that window 
> are "large" (for some definition of "large" that is likely 
> less than 4KB and is certainly no larger than the rendezvous 
> threshold), I would contend that generating a software ack 
> for each one would be essentially zero overhead and would 
> allow source side tracking of remote completion such that a 
> flushall could be a local operation.
> 
> There is a completely separate discussion that needs to occur 
> on whether it is better to add collective completion to 
> passive target or one-sided completion to active target (both 
> of which are likely to meet some resistance because of the 
> impurity the introduce into the model/naming) or whether the 
> two need to be mixed at all.
> 
> Keith
> 
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> > Sent: Friday, May 21, 2010 7:16 AM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> > 
> > It is now obvious that I did not do a sufficient job of explaining
> > things
> > before. It's not even clear to me anymore exactly what we're talking
> > about
> > - is this ARMCI? MPI2? MPI3? Active-target one-sided? 
> Passive target?
> > My
> > previous explanation was for BG/P MPI2 active-target one-sided using
> > MPI_Win_fence synchronization. BG/P MPI2 implementations for
> > MPI_Win_post/start/complete/wait and passive-target 
> MPI_Win_lock/unlock
> > use
> > different methods for determining target completion, although
> > internally
> > they may use some/all of the same structures.
> > 
> > What was the original question?
> > 
> > _______________________________________________
> > Douglas Miller                  BlueGene Messaging Development
> > IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> > dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> > 
> > 
> > 
> >              "Underwood, Keith
> >              D"
> >              <keith.d.underwoo
> > To
> >              d at intel.com>              "MPI 3.0 Remote Memory Access
> >              Sent by:                  working group"
> >              mpi3-rma-bounces@         
> <mpi3-rma at lists.mpi-forum.org>
> >              lists.mpi-forum.o
> > cc
> >              rg
> > 
> > Subject
> >                                        Re: [Mpi3-rma] RMA proposal 1
> >              05/20/2010 04:56          update
> >              PM
> > 
> > 
> >              Please respond to
> >               "MPI 3.0 Remote
> >                Memory Access
> >               working group"
> >              <mpi3-rma at lists.m
> >                pi-forum.org>
> > 
> > 
> > 
> > 
> > 
> > 
> > My point was, the way Jeff is doing synchronization in 
> NWChem is via a
> > fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD. 
> If I knew
> > he
> > was going to be primarily doing this (ie, that he wanted to 
> know that
> > all
> > nodes were synched), I would do something like maintain 
> counts of sent
> > and
> > received messages on each node. I could then do something like an
> > allreduce
> > of those 2 ints over the tree to determine if everyone is synched.
> > There
> > are probably some technical details that would have to be 
> worked out to
> > ensure this works but it seems good from 10000 feet.
> > 
> > Right now we do numprocs 0-byte get operations to make sure 
> the torus
> > is
> > flushed on each node. A torus operation is ~3us on a 
> 512-way. It grows
> > slowly with number of midplanes. I'm sure a 72 rack longest 
> Manhattan
> > distance noncongested pingpong is <10us, but I don't have 
> the data in
> > front
> > of me.
> > 
> > Based on Doug’s email, I had assumed you would know who you 
> have sent
> > messages to
  If you knew that in a given fence interval 
> the node had
> > only
> > sent distinct messages to 1K other cores, you would only 
> have 1K gets
> > to
> > issue.  Suck?  Yes.  Worse than the tree messages?  Maybe, 
> maybe not.
> > There is definitely a cross-over between 1 and np 
> outstanding messages
> > between fences where on the 1 side of things the tree messages are
> > worse
> > and on the np side of things the tree messages are better.  There is
> > another spectrum based on request size where getting a response for
> > every
> > request becomes an inconsequential overhead.  I would have 
> to know the
> > cost
> > of processing a message, the size of a response, and the cost of
> > generating
> > that response to create a proper graph of that.
> > 
> > A tree int/sum is roughly 5us on a 512-way and grows 
> similarly. I would
> > postulate that a 72 rack MPI allreduce int/sum is on the 
> order of 10us.
> > 
> > So you generate np*np messages vs 1 tree message. Contention and all
> > the
> > overhead of that many messages will be significantly worse than even
> > several tree messages.
> > Oh, wait, so, you would sum all sent and sum all received and then
> > check if
> > they were equal?  And then (presumably) iterate until the answer was
> > yes?
> > Hrm.  That is more interesting.  Can you easily separate 
> one-sided and
> > two
> > sided messages in your counting while maintaining the performance of
> > one-sided messages?
> > Doug’s earlier answer implied you were going to allreduce a 
> vector of
> > counts (one per rank) and that would have been ugly.   I am assuming
> > you
> > would do at least 2 tree messages in what I believe you are 
> describing,
> > so
> > there is still a crossover between n*np messages and m tree messages
> > (where
> > n is the number of outstanding requests between fencealls 
> and 2 <= m <=
> > 10), and the locality of communications impacts that crossover

> > BTW, can you actually generate messages fast enough to 
> cause contention
> > with tiny messages?
> > Anytime I know that an operation is collective, I can 
> almost guarantee
> > I
> > can do it better than even a good pt2pt algorithm if I am 
> utilizing our
> > collective network. I think on machines that have remote completion
> > notification an allfenceall() is just a barrier(), and since
> > fenceall();
> > barrier(); is going to be replaced by allfenceall(), it 
> doesn't seem to
> > me
> > like it is any extra overhead if allfenceall() is just a 
> barrier() for
> > you.
> > 
> > 
> > My concerns are twofold:  1) we are talking about adding collective
> > completion to passive target when active target was the one 
> designed to
> > have collective completion.  That is semantically and API-wise a bit
> > ugly.
> > 2) I think the allfenceall() as a collective will optimize 
> to the case
> > where you have outstanding requests to everybody and I believe that
> > will be
> > slower than the typical  case of having outstanding requests to some
> > people.  I think that users would typically call 
> allfenceall() rather
> > than
> > fenceall() + barrier() and then they would see a 
> performance paradox:
> > the
> > fenceall() + barrier() could be substantially faster when you have a
> > “small” number of peers you are communicating with in this 
> iteration.
> > I am
> > not at all worried about the overhead of allfenceall() for networks
> > with
> > remote completion.
> > Keith
> > 
> > 
> > 
> >  From:   "Underwood, Keith D" <keith.d.underwood at intel.com>
> > 
> >  To:     "MPI 3.0 Remote Memory Access working group"
> >          <mpi3-rma at lists.mpi-forum.org>
> > 
> >  Date:   05/20/2010 09:19 AM
> > 
> >  Subject Re: [Mpi3-rma] RMA proposal 1 update
> >  :
> > 
> >  Sent    mpi3-rma-bounces at lists.mpi-forum.org
> >  by:
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>