[Mpi3-rma] RMA proposal 1 update

Fri May 21 15:27:17 CDT 2010

I think (although I am not an ARMCI expert) that passive target with the lockall/unlockall and the flush/flushall extensions we have proposed gives you (approximately) ARMCI.  The question is then:  where does the allflushall to support GA belong?  ARMCI is actually an easier target than GA, because (like many models) ARMCI only has one-sided completion.  But, I thought you wanted better than ARMCI?

The struggle here is that GA has both one-sided and collective completion.  Collective completion can obviously be emulated by one-sided completion + a barrier, but you have indicated that that is a performance issue.  Unfortunately, there is not an obvious place inside of the existing MPI RMA interface where the mixture of collective and one-sided completion fit.  The need for both one-sided and collective completion was clearly not architected into MPI one-sided and may very well break the architecture.  It certainly breaks the naming (there is nothing "passive" about a target that calls allfenceall and there is nothing "active" about a target when an initiator does one-side completion).

Does anybody know of another model (other than GA) that calls for a mixture of collective and one-sided completion?  CoArray Fortran uses collective completion, UPC expects one-sided completion, SHMEM only exposes one-sided completiong, ARMCI only exposes one-sided completion...  If we could look at a second model that needed a mixture, it might help us formulate a better solution.

Keith

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> Sent: Friday, May 21, 2010 2:11 PM
> To: MPI 3.0 Remote Memory Access working group
> Cc: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> Please explain how active target gives me ARMCI.
>
> Jeff
>
> Sent from my iPhone
>
> On May 21, 2010, at 3:05 PM, "Barrett, Brian W" <bwbarre at sandia.gov>
> wrote:
>
> > I don't know about illegal as much as "doesn't make sense".  Keith
> > brough this up, but I think it got lost in other discussions...
> > What is the semantic of a collective operation during a point-to-
> > point epoch?  We're constrained by the existing API (as we've
> > decided not to write a new API)
> >
> > I can see a couple of scenarios for allflushall and passive target,
> > none of which I like.  The first is that allflushall only flushes
> > with peers it currently holds a window open to.  This means tracking
> > such state, which I think is a bad idea due to state required.  The
> > second option is that it's erroneous to call allflushall unless
> > there's a passive access epoch to every peer in the window.  This
> > seems to encourage a behavior I dont' like (namely, having to have
> > such an epoch open at all times).
> >
> > I could possibly see the benefit of an allflushall in an active
> > target, where group semantics are a bit more well-defined, but
> > that's an entirely different discussion.
> >
> > Brian
> >
> > --
> >  Brian W. Barrett
> >  Scalable System Software Group
> >  Sandia National Laboratories
> > ________________________________________
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Underwood, Keith D
> > [keith.d.underwood at intel.com]
> > Sent: Friday, May 21, 2010 1:44 PM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > That's an interesting point.  I would think that would be illegal?
> >
> > Keith
> >
> >> -----Original Message-----
> >> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> >> bounces at lists.mpi-forum.org] On Behalf Of Rajeev Thakur
> >> Sent: Friday, May 21, 2010 1:38 PM
> >> To: 'MPI 3.0 Remote Memory Access working group'
> >> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >>
> >> Lock-unlock is not collective. If we add an allflushall, how would
> >> the
> >> processes that don't call lock-unlock call it? Just directly?
> >>
> >> Rajeev
> >>
> >>
> >>> -----Original Message-----
> >>> From: mpi3-rma-bounces at lists.mpi-forum.org
> >>> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of
> >>> Underwood, Keith D
> >>> Sent: Friday, May 21, 2010 11:22 AM
> >>> To: MPI 3.0 Remote Memory Access working group
> >>> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >>>
> >>> There is a proposal on the table to add flush(rank) and
> >>> flushall() as local calls for passive target to allow
> >>> incremental remote completion without calling unlock().
> >>> There is an additional proposal to also include
> >>> allflushall(), which would be a collective remote completion
> >>> semantic added to passive target (nominally, we might not
> >>> want to call it passive target anymore if we did that, but
> >>> that is a different discussion).  The question I had posed
> >>> was:  can you really get a measurable performance advantage
> >>> in realistic usage scenarios by having allflushall() instead
> >>> of just doing flushall() + barrier()?  I was asking if
> >>> someone could provide an implementation sketch - preferably
> >>> along with a real usage scenario where that implementation
> >>> sketch would yield a performance advantage.  A
> >>> back-of-the-envelope assessment of that would be a really
> >>> nice to have.
> >>>
> >>> My contention was that the number of targets with outstanding
> >>> requests from a given between one flush/flushall and the next
> >>> one would frequently not be large enough to justify the cost
> >>> of a collective.  Alternatively, if messages in that window
> >>> are "large" (for some definition of "large" that is likely
> >>> less than 4KB and is certainly no larger than the rendezvous
> >>> threshold), I would contend that generating a software ack
> >>> for each one would be essentially zero overhead and would
> >>> allow source side tracking of remote completion such that a
> >>> flushall could be a local operation.
> >>>
> >>> There is a completely separate discussion that needs to occur
> >>> on whether it is better to add collective completion to
> >>> passive target or one-sided completion to active target (both
> >>> of which are likely to meet some resistance because of the
> >>> impurity the introduce into the model/naming) or whether the
> >>> two need to be mixed at all.
> >>>
> >>> Keith
> >>>
> >>>> -----Original Message-----
> >>>> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> >>>> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> >>>> Sent: Friday, May 21, 2010 7:16 AM
> >>>> To: MPI 3.0 Remote Memory Access working group
> >>>> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >>>>
> >>>> It is now obvious that I did not do a sufficient job of explaining
> >>>> things
> >>>> before. It's not even clear to me anymore exactly what we're
> >> talking
> >>>> about
> >>>> - is this ARMCI? MPI2? MPI3? Active-target one-sided?
> >>> Passive target?
> >>>> My
> >>>> previous explanation was for BG/P MPI2 active-target one-sided
> >> using
> >>>> MPI_Win_fence synchronization. BG/P MPI2 implementations for
> >>>> MPI_Win_post/start/complete/wait and passive-target
> >>> MPI_Win_lock/unlock
> >>>> use
> >>>> different methods for determining target completion, although
> >>>> internally
> >>>> they may use some/all of the same structures.
> >>>>
> >>>> What was the original question?
> >>>>
> >>>> _______________________________________________
> >>>> Douglas Miller                  BlueGene Messaging Development
> >>>> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> >>>> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> >>>>
> >>>>
> >>>>
> >>>>             "Underwood, Keith
> >>>>             D"
> >>>>             <keith.d.underwoo
> >>>> To
> >>>>             d at intel.com>              "MPI 3.0 Remote Memory
> >> Access
> >>>>             Sent by:                  working group"
> >>>>             mpi3-rma-bounces@
> >>> <mpi3-rma at lists.mpi-forum.org>
> >>>>             lists.mpi-forum.o
> >>>> cc
> >>>>             rg
> >>>>
> >>>> Subject
> >>>>                                       Re: [Mpi3-rma] RMA proposal
> >> 1
> >>>>             05/20/2010 04:56          update
> >>>>             PM
> >>>>
> >>>>
> >>>>             Please respond to
> >>>>              "MPI 3.0 Remote
> >>>>               Memory Access
> >>>>              working group"
> >>>>             <mpi3-rma at lists.m
> >>>>               pi-forum.org>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> My point was, the way Jeff is doing synchronization in
> >>> NWChem is via a
> >>>> fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD.
> >>> If I knew
> >>>> he
> >>>> was going to be primarily doing this (ie, that he wanted to
> >>> know that
> >>>> all
> >>>> nodes were synched), I would do something like maintain
> >>> counts of sent
> >>>> and
> >>>> received messages on each node. I could then do something like an
> >>>> allreduce
> >>>> of those 2 ints over the tree to determine if everyone is synched.
> >>>> There
> >>>> are probably some technical details that would have to be
> >>> worked out to
> >>>> ensure this works but it seems good from 10000 feet.
> >>>>
> >>>> Right now we do numprocs 0-byte get operations to make sure
> >>> the torus
> >>>> is
> >>>> flushed on each node. A torus operation is ~3us on a
> >>> 512-way. It grows
> >>>> slowly with number of midplanes. I'm sure a 72 rack longest
> >>> Manhattan
> >>>> distance noncongested pingpong is <10us, but I don't have
> >>> the data in
> >>>> front
> >>>> of me.
> >>>>
> >>>> Based on Doug's email, I had assumed you would know who you
> >>> have sent
> >>>> messages to.  If you knew that in a given fence interval
> >>> the node had
> >>>> only
> >>>> sent distinct messages to 1K other cores, you would only
> >>> have 1K gets
> >>>> to
> >>>> issue.  Suck?  Yes.  Worse than the tree messages?  Maybe,
> >>> maybe not.
> >>>> There is definitely a cross-over between 1 and np
> >>> outstanding messages
> >>>> between fences where on the 1 side of things the tree messages are
> >>>> worse
> >>>> and on the np side of things the tree messages are better.  There
> >> is
> >>>> another spectrum based on request size where getting a response
> for
> >>>> every
> >>>> request becomes an inconsequential overhead.  I would have
> >>> to know the
> >>>> cost
> >>>> of processing a message, the size of a response, and the cost of
> >>>> generating
> >>>> that response to create a proper graph of that.
> >>>>
> >>>> A tree int/sum is roughly 5us on a 512-way and grows
> >>> similarly. I would
> >>>> postulate that a 72 rack MPI allreduce int/sum is on the
> >>> order of 10us.
> >>>>
> >>>> So you generate np*np messages vs 1 tree message. Contention and
> >> all
> >>>> the
> >>>> overhead of that many messages will be significantly worse than
> >> even
> >>>> several tree messages.
> >>>> Oh, wait, so, you would sum all sent and sum all received and then
> >>>> check if
> >>>> they were equal?  And then (presumably) iterate until the answer
> >> was
> >>>> yes?
> >>>> Hrm.  That is more interesting.  Can you easily separate
> >>> one-sided and
> >>>> two
> >>>> sided messages in your counting while maintaining the performance
> >> of
> >>>> one-sided messages?
> >>>> Doug's earlier answer implied you were going to allreduce a
> >>> vector of
> >>>> counts (one per rank) and that would have been ugly.   I am
> >> assuming
> >>>> you
> >>>> would do at least 2 tree messages in what I believe you are
> >>> describing,
> >>>> so
> >>>> there is still a crossover between n*np messages and m tree
> >> messages
> >>>> (where
> >>>> n is the number of outstanding requests between fencealls
> >>> and 2 <= m <=
> >>>> 10), and the locality of communications impacts that crossover.
> >>>> BTW, can you actually generate messages fast enough to
> >>> cause contention
> >>>> with tiny messages?
> >>>> Anytime I know that an operation is collective, I can
> >>> almost guarantee
> >>>> I
> >>>> can do it better than even a good pt2pt algorithm if I am
> >>> utilizing our
> >>>> collective network. I think on machines that have remote
> completion
> >>>> notification an allfenceall() is just a barrier(), and since
> >>>> fenceall();
> >>>> barrier(); is going to be replaced by allfenceall(), it
> >>> doesn't seem to
> >>>> me
> >>>> like it is any extra overhead if allfenceall() is just a
> >>> barrier() for
> >>>> you.
> >>>>
> >>>>
> >>>> My concerns are twofold:  1) we are talking about adding
> collective
> >>>> completion to passive target when active target was the one
> >>> designed to
> >>>> have collective completion.  That is semantically and API-wise a
> >> bit
> >>>> ugly.
> >>>> 2) I think the allfenceall() as a collective will optimize
> >>> to the case
> >>>> where you have outstanding requests to everybody and I believe
> that
> >>>> will be
> >>>> slower than the typical  case of having outstanding requests to
> >> some
> >>>> people.  I think that users would typically call
> >>> allfenceall() rather
> >>>> than
> >>>> fenceall() + barrier() and then they would see a
> >>> performance paradox:
> >>>> the
> >>>> fenceall() + barrier() could be substantially faster when you have
> >> a
> >>>> "small" number of peers you are communicating with in this
> >>> iteration.
> >>>> I am
> >>>> not at all worried about the overhead of allfenceall() for
> networks
> >>>> with
> >>>> remote completion.
> >>>> Keith
> >>>>
> >>>>
> >>>>
> >>>> From:   "Underwood, Keith D" <keith.d.underwood at intel.com>
> >>>>
> >>>> To:     "MPI 3.0 Remote Memory Access working group"
> >>>>         <mpi3-rma at lists.mpi-forum.org>
> >>>>
> >>>> Date:   05/20/2010 09:19 AM
> >>>>
> >>>> Subject Re: [Mpi3-rma] RMA proposal 1 update
> >>>> :
> >>>>
> >>>> Sent    mpi3-rma-bounces at lists.mpi-forum.org
> >>>> by:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> mpi3-rma mailing list
> >>> mpi3-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >>>
> >>
> >>
> >> _______________________________________________
> >> mpi3-rma mailing list
> >> mpi3-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma