[Mpi3-rma] RMA proposal 1 update

Barrett, Brian W bwbarre at sandia.gov
Fri May 21 15:05:28 CDT 2010


I don't know about illegal as much as "doesn't make sense".  Keith brough this up, but I think it got lost in other discussions...  What is the semantic of a collective operation during a point-to-point epoch?  We're constrained by the existing API (as we've decided not to write a new API)

I can see a couple of scenarios for allflushall and passive target, none of which I like.  The first is that allflushall only flushes with peers it currently holds a window open to.  This means tracking such state, which I think is a bad idea due to state required.  The second option is that it's erroneous to call allflushall unless there's a passive access epoch to every peer in the window.  This seems to encourage a behavior I dont' like (namely, having to have such an epoch open at all times).

I could possibly see the benefit of an allflushall in an active target, where group semantics are a bit more well-defined, but that's an entirely different discussion.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories
________________________________________
From: mpi3-rma-bounces at lists.mpi-forum.org [mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Underwood, Keith D [keith.d.underwood at intel.com]
Sent: Friday, May 21, 2010 1:44 PM
To: MPI 3.0 Remote Memory Access working group
Subject: Re: [Mpi3-rma] RMA proposal 1 update

That's an interesting point.  I would think that would be illegal?

Keith

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Rajeev Thakur
> Sent: Friday, May 21, 2010 1:38 PM
> To: 'MPI 3.0 Remote Memory Access working group'
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> Lock-unlock is not collective. If we add an allflushall, how would the
> processes that don't call lock-unlock call it? Just directly?
>
> Rajeev
>
>
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org
> > [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of
> > Underwood, Keith D
> > Sent: Friday, May 21, 2010 11:22 AM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > There is a proposal on the table to add flush(rank) and
> > flushall() as local calls for passive target to allow
> > incremental remote completion without calling unlock().
> > There is an additional proposal to also include
> > allflushall(), which would be a collective remote completion
> > semantic added to passive target (nominally, we might not
> > want to call it passive target anymore if we did that, but
> > that is a different discussion).  The question I had posed
> > was:  can you really get a measurable performance advantage
> > in realistic usage scenarios by having allflushall() instead
> > of just doing flushall() + barrier()?  I was asking if
> > someone could provide an implementation sketch - preferably
> > along with a real usage scenario where that implementation
> > sketch would yield a performance advantage.  A
> > back-of-the-envelope assessment of that would be a really
> > nice to have.
> >
> > My contention was that the number of targets with outstanding
> > requests from a given between one flush/flushall and the next
> > one would frequently not be large enough to justify the cost
> > of a collective.  Alternatively, if messages in that window
> > are "large" (for some definition of "large" that is likely
> > less than 4KB and is certainly no larger than the rendezvous
> > threshold), I would contend that generating a software ack
> > for each one would be essentially zero overhead and would
> > allow source side tracking of remote completion such that a
> > flushall could be a local operation.
> >
> > There is a completely separate discussion that needs to occur
> > on whether it is better to add collective completion to
> > passive target or one-sided completion to active target (both
> > of which are likely to meet some resistance because of the
> > impurity the introduce into the model/naming) or whether the
> > two need to be mixed at all.
> >
> > Keith
> >
> > > -----Original Message-----
> > > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > > bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> > > Sent: Friday, May 21, 2010 7:16 AM
> > > To: MPI 3.0 Remote Memory Access working group
> > > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> > >
> > > It is now obvious that I did not do a sufficient job of explaining
> > > things
> > > before. It's not even clear to me anymore exactly what we're
> talking
> > > about
> > > - is this ARMCI? MPI2? MPI3? Active-target one-sided?
> > Passive target?
> > > My
> > > previous explanation was for BG/P MPI2 active-target one-sided
> using
> > > MPI_Win_fence synchronization. BG/P MPI2 implementations for
> > > MPI_Win_post/start/complete/wait and passive-target
> > MPI_Win_lock/unlock
> > > use
> > > different methods for determining target completion, although
> > > internally
> > > they may use some/all of the same structures.
> > >
> > > What was the original question?
> > >
> > > _______________________________________________
> > > Douglas Miller                  BlueGene Messaging Development
> > > IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> > > dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> > >
> > >
> > >
> > >              "Underwood, Keith
> > >              D"
> > >              <keith.d.underwoo
> > > To
> > >              d at intel.com>              "MPI 3.0 Remote Memory
> Access
> > >              Sent by:                  working group"
> > >              mpi3-rma-bounces@
> > <mpi3-rma at lists.mpi-forum.org>
> > >              lists.mpi-forum.o
> > > cc
> > >              rg
> > >
> > > Subject
> > >                                        Re: [Mpi3-rma] RMA proposal
> 1
> > >              05/20/2010 04:56          update
> > >              PM
> > >
> > >
> > >              Please respond to
> > >               "MPI 3.0 Remote
> > >                Memory Access
> > >               working group"
> > >              <mpi3-rma at lists.m
> > >                pi-forum.org>
> > >
> > >
> > >
> > >
> > >
> > >
> > > My point was, the way Jeff is doing synchronization in
> > NWChem is via a
> > > fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD.
> > If I knew
> > > he
> > > was going to be primarily doing this (ie, that he wanted to
> > know that
> > > all
> > > nodes were synched), I would do something like maintain
> > counts of sent
> > > and
> > > received messages on each node. I could then do something like an
> > > allreduce
> > > of those 2 ints over the tree to determine if everyone is synched.
> > > There
> > > are probably some technical details that would have to be
> > worked out to
> > > ensure this works but it seems good from 10000 feet.
> > >
> > > Right now we do numprocs 0-byte get operations to make sure
> > the torus
> > > is
> > > flushed on each node. A torus operation is ~3us on a
> > 512-way. It grows
> > > slowly with number of midplanes. I'm sure a 72 rack longest
> > Manhattan
> > > distance noncongested pingpong is <10us, but I don't have
> > the data in
> > > front
> > > of me.
> > >
> > > Based on Doug's email, I had assumed you would know who you
> > have sent
> > > messages to.  If you knew that in a given fence interval
> > the node had
> > > only
> > > sent distinct messages to 1K other cores, you would only
> > have 1K gets
> > > to
> > > issue.  Suck?  Yes.  Worse than the tree messages?  Maybe,
> > maybe not.
> > > There is definitely a cross-over between 1 and np
> > outstanding messages
> > > between fences where on the 1 side of things the tree messages are
> > > worse
> > > and on the np side of things the tree messages are better.  There
> is
> > > another spectrum based on request size where getting a response for
> > > every
> > > request becomes an inconsequential overhead.  I would have
> > to know the
> > > cost
> > > of processing a message, the size of a response, and the cost of
> > > generating
> > > that response to create a proper graph of that.
> > >
> > > A tree int/sum is roughly 5us on a 512-way and grows
> > similarly. I would
> > > postulate that a 72 rack MPI allreduce int/sum is on the
> > order of 10us.
> > >
> > > So you generate np*np messages vs 1 tree message. Contention and
> all
> > > the
> > > overhead of that many messages will be significantly worse than
> even
> > > several tree messages.
> > > Oh, wait, so, you would sum all sent and sum all received and then
> > > check if
> > > they were equal?  And then (presumably) iterate until the answer
> was
> > > yes?
> > > Hrm.  That is more interesting.  Can you easily separate
> > one-sided and
> > > two
> > > sided messages in your counting while maintaining the performance
> of
> > > one-sided messages?
> > > Doug's earlier answer implied you were going to allreduce a
> > vector of
> > > counts (one per rank) and that would have been ugly.   I am
> assuming
> > > you
> > > would do at least 2 tree messages in what I believe you are
> > describing,
> > > so
> > > there is still a crossover between n*np messages and m tree
> messages
> > > (where
> > > n is the number of outstanding requests between fencealls
> > and 2 <= m <=
> > > 10), and the locality of communications impacts that crossover.
> > > BTW, can you actually generate messages fast enough to
> > cause contention
> > > with tiny messages?
> > > Anytime I know that an operation is collective, I can
> > almost guarantee
> > > I
> > > can do it better than even a good pt2pt algorithm if I am
> > utilizing our
> > > collective network. I think on machines that have remote completion
> > > notification an allfenceall() is just a barrier(), and since
> > > fenceall();
> > > barrier(); is going to be replaced by allfenceall(), it
> > doesn't seem to
> > > me
> > > like it is any extra overhead if allfenceall() is just a
> > barrier() for
> > > you.
> > >
> > >
> > > My concerns are twofold:  1) we are talking about adding collective
> > > completion to passive target when active target was the one
> > designed to
> > > have collective completion.  That is semantically and API-wise a
> bit
> > > ugly.
> > > 2) I think the allfenceall() as a collective will optimize
> > to the case
> > > where you have outstanding requests to everybody and I believe that
> > > will be
> > > slower than the typical  case of having outstanding requests to
> some
> > > people.  I think that users would typically call
> > allfenceall() rather
> > > than
> > > fenceall() + barrier() and then they would see a
> > performance paradox:
> > > the
> > > fenceall() + barrier() could be substantially faster when you have
> a
> > > "small" number of peers you are communicating with in this
> > iteration.
> > > I am
> > > not at all worried about the overhead of allfenceall() for networks
> > > with
> > > remote completion.
> > > Keith
> > >
> > >
> > >
> > >  From:   "Underwood, Keith D" <keith.d.underwood at intel.com>
> > >
> > >  To:     "MPI 3.0 Remote Memory Access working group"
> > >          <mpi3-rma at lists.mpi-forum.org>
> > >
> > >  Date:   05/20/2010 09:19 AM
> > >
> > >  Subject Re: [Mpi3-rma] RMA proposal 1 update
> > >  :
> > >
> > >  Sent    mpi3-rma-bounces at lists.mpi-forum.org
> > >  by:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma




More information about the mpiwg-rma mailing list