[Mpi3-rma] RMA proposal 1 update

Tue May 18 11:59:31 CDT 2010

Just because you may need access to the entire array doesn't mean that you will access the entire array in every interval between calls to flushall().  The choices other than flush()/flushall() are not pretty for completion, though.  

Actually, a perfect example of when I doubt that allflushall() is going to be efficient this way is HPCC RandomAccess on a 1M rank system.  In an interval, you may need to access any node, but you are going to access (at most) 1024 other nodes.  Now, if you call allflushall(), which is the obvious thing to do, it is seems like it would be slower than calling flushall()+barrier().  After all, the source knows that it only needs to check the status at 1024 ranks (worst case) rather than doing a 4MB allreduce(). Any chance you could do a back-of-the-envelope calculation as to where the cross-over is?

BTW, if you could get hardware to expose remote completion information to you, this point would be moot ;-)

Keith

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> Sent: Tuesday, May 18, 2010 10:04 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> 
> If the fence epoch involved 1 million ranks all doing origin and/or
> target
> operations, then the allreduce is likely to be efficient (unless a
> reduce_scatter_block sort of thing exists). If there are only 10
> participants doing RMAs in a 1 million participant fence epoch, one
> would
> have to question why use fence for the synchronization - or why use a 1
> million member communicator to create the window. That does not seem
> like
> an optimal communication pattern.
> 
> 
> _______________________________________________
> Douglas Miller                  BlueGene Messaging Development
> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> 
> 
> 
>              "Underwood, Keith
>              D"
>              <keith.d.underwoo
> To
>              d at intel.com>              "MPI 3.0 Remote Memory Access
>              Sent by:                  working group"
>              mpi3-rma-bounces@         <mpi3-rma at lists.mpi-forum.org>
>              lists.mpi-forum.o
> cc
>              rg
> 
> Subject
>                                        Re: [Mpi3-rma] RMA proposal 1
>              05/18/2010 10:51          update
>              AM
> 
> 
>              Please respond to
>               "MPI 3.0 Remote
>                Memory Access
>               working group"
>              <mpi3-rma at lists.m
>                pi-forum.org>
> 
> 
> 
> 
> 
> 
> Ah, so you allreduce(MPI_SUM, 1 million element integer vector == 4MB)?
> Then you know how many things you should have received and when you
> have
> received all of those you can enter a barrier?
> 
> Ok, now I am curious about a couple of other things:
> 
> 1) Can you separate RMA operations from other operations in your
> counts?
> E.g. one-sided, non-blocking collectives, etc.
> 2) is an allreduce of a big vector really more efficient in the general
> case?  I can see how it might (maybe) be better for HPCC RandomAccess
> types
> of things (i.e. if you did RandomAccess as Accumulate operations), but
> I am
> dubious about whether it helps for anything else.
> 
> I'm just wanting to make sure that the optimization potential here is
> real...
> 
> Thanks,
> Keith
> 
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> > Sent: Tuesday, May 18, 2010 9:44 AM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > If each origin (in a fence epoch) keeps track of the count(s) of RMA
> > operations to each of its targets, then an allreduce of those arrays
> > will
> > tell each target how many operations were done to itself and can be
> > used to
> > determine completion.
> >
> > _______________________________________________
> > Douglas Miller                  BlueGene Messaging Development
> > IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> > dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> >
> >
> >
> >              "Underwood, Keith
> >              D"
> >              <keith.d.underwoo
> > To
> >              d at intel.com>              "MPI 3.0 Remote Memory Access
> >              Sent by:                  working group"
> >              mpi3-rma-bounces@         <mpi3-rma at lists.mpi-forum.org>
> >              lists.mpi-forum.o
> > cc
> >              rg
> >
> > Subject
> >                                        Re: [Mpi3-rma] RMA proposal 1
> >              05/18/2010 10:23          update
> >              AM
> >
> >
> >              Please respond to
> >               "MPI 3.0 Remote
> >                Memory Access
> >               working group"
> >              <mpi3-rma at lists.m
> >                pi-forum.org>
> >
> >
> >
> >
> >
> >
> > Sorry, but you lost me at “we could just do an allreduce to look at
> > counts”.  Could you go into a bit more detail?  If you have received
> > counts
> > from all ranks at all ranks (um, that doesn’t seem scalable), then it
> > would
> > seem that an allfenceall() would require an Alltoall() to figure out
> if
> > everybody was safe.  I don’t see how an allreduce would do the job.
> > But,
> > I’ll admit that I don’t know really DCMF or BG network interface
> > architecture or… So, I could just be missing something here.
> >
> > Thanks,
> > Keith
> >
> > From: mpi3-rma-bounces at lists.mpi-forum.org [
> > mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Brian Smith
> > Sent: Tuesday, May 18, 2010 4:57 AM
> > To: MPI 3.0 Remote Memory Access working group
> > Cc: MPI 3.0 Remote Memory Access working group;
> > mpi3-rma-bounces at lists.mpi-forum.org
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> >
> > Sorry for the late response....
> > On BGP, DCMF Put/Get doesn't do any accounting and DCMF doesn't
> > actually
> > have a fence operation. There is no hardware to determine when a
> > put/get
> > has completed either. We need to send a get along the same
> > (deterministically routed) path to "flush" any messages out to claim
> we
> > are
> > synchronized.
> >
> > When we implemented ARMCI, we introduced accounting in our "glue" on
> > top of
> > DCMF because of the ARMCI_Fence() operation. There are similar
> concerns
> > in
> > the MPI one-sided "glue".
> >
> > Going forward, we need to figure out how we'd implement the new MPI
> RMA
> > operations and determine if there would be accounting required. If
> > there
> > would be (and I'm thinking there would), then an allfenceall in MPI
> > would
> > be easy enough to do and would provide a significant benefit on BG.
> We
> > could just do an allreduce to look at counts. If the standard
> procedure
> > is
> > fenceall()+barrier(), I could do that much better as an allfenceall
> > call.
> >
> > On platforms that have some sort of native accounting, this
> allfenceall
> > would only be the overhead of a barrier. So I think an allfenceall
> has
> > significant value to the middleware more than DCMF and therefore
> would
> > strongly encourage it in MPI, especially given the use-cases we heard
> > from
> > Jeff H. at the forum meeting.
> >
> > This scenario is the same in our next super-secret product offering
> > everyone knows about but I don't know if *I* can mention.
> >
> >
> > Brian Smith (smithbr at us.ibm.com)
> > BlueGene MPI Development/
> > Communications Team Lead
> > IBM Rochester
> > Phone: 507 253 4717
> >
> >
> >
> >
> >  From:   "Underwood, Keith D" <keith.d.underwood at intel.com>
> >
> >  To:     "MPI 3.0 Remote Memory Access working group"
> >          <mpi3-rma at lists.mpi-forum.org>
> >
> >  Date:   05/16/2010 09:33 PM
> >
> >  Subject Re: [Mpi3-rma] RMA proposal 1 update
> >  :
> >
> >  Sent    mpi3-rma-bounces at lists.mpi-forum.org
> >  by:
> >
> >
> >
> >
> >
> >
> >
> >
> > Before doing that, can someone sketch out the platform/API and the
> > implementation that makes that more efficient?  There is no gain for
> > Portals (3 or 4).  There is no gain for anything that supports Cray
> > SHMEM
> > reasonably well (shmem_quiet() is approximately the same semantics as
> > MPI_flush_all).  Hrm, you can probably say the same thing about
> > anything
> > that supports UPC well - a strict access is basically a
> > MPI_flush_all();
> > MPI_Put(); MPI_flush_all();... Also, I thought somebody said that IB
> > gave
> > you a notification of remote completion...
> >
> > The question then turns to the "other networks".  If you can't figure
> > out
> > remote completion, then the collective is going to be pretty heavy,
> > right?
> >
> > Keith
> >
> > > -----Original Message-----
> > > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > > bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> > > Sent: Sunday, May 16, 2010 7:27 PM
> > > To: MPI 3.0 Remote Memory Access working group
> > > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> > >
> > > Tortsten,
> > >
> > > There seemed to be decent agreement on adding MPI_Win_all_flush_all
> > > (equivalent to MPI_Win_flush_all called from every rank in the
> > > communicator associated with the window) since this function can be
> > > implemented far more efficiently as a collective than the
> equivalent
> > > point-wise function calls.
> > >
> > > Is there a problem with adding this to your proposal?
> > >
> > > Jeff
> > >
> > > On Sun, May 16, 2010 at 12:48 AM, Torsten Hoefler
> <htor at illinois.edu>
> > > wrote:
> > > > Hello all,
> > > >
> > > > After the discussions at the last Forum I updated the group's
> first
> > > > proposal.
> > > >
> > > > The proposal (one-side-2.pdf) is attached to the wiki page
> > > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > > >
> > > > The changes with regards to the last version are:
> > > >
> > > > 1) added MPI_NOOP to MPI_Get_accumulate and MPI_Accumulate_get
> > > >
> > > > 2) (re)added MPI_Win_flush and MPI_Win_flush_all to passive
> target
> > > mode
> > > >
> > > > Some remarks:
> > > >
> > > > 1) We didn't straw-vote on MPI_Accumulate_get, so this function
> > might
> > > >   go. The removal would be very clean.
> > > >
> > > > 2) Should we allow MPI_NOOP in MPI_Accumulate (this does not make
> > > sense
> > > >   and is incorrect in my current proposal)
> > > >
> > > > 3) Should we allow MPI_REPLACE in
> > > MPI_Get_accumulate/MPI_Accumulate_get?
> > > >   (this would make sense and is allowed in the current proposal
> but
> > > we
> > > >   didn't talk about it in the group)
> > > >
> > > >
> > > > All the Best,
> > > >  Torsten
> > > >
> > > > --
> > > >  bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -
> --
> > --
> > > > Torsten Hoefler         | Research Associate
> > > > Blue Waters Directorate | University of Illinois
> > > > 1205 W Clark Street     | Urbana, IL, 61801
> > > > NCSA Building           | +01 (217) 244-7736
> > > > _______________________________________________
> > > > mpi3-rma mailing list
> > > > mpi3-rma at lists.mpi-forum.org
> > > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > > >
> > >
> > >
> > >
> > > --
> > > Jeff Hammond
> > > Argonne Leadership Computing Facility
> > > jhammond at mcs.anl.gov / (630) 252-5381
> > > http://www.linkedin.com/in/jeffhammond
> > >
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma