[Mpi3-rma] RMA proposal 1 update

Tue May 18 11:04:22 CDT 2010

If the fence epoch involved 1 million ranks all doing origin and/or target
operations, then the allreduce is likely to be efficient (unless a
reduce_scatter_block sort of thing exists). If there are only 10
participants doing RMAs in a 1 million participant fence epoch, one would
have to question why use fence for the synchronization - or why use a 1
million member communicator to create the window. That does not seem like
an optimal communication pattern.

_______________________________________________
Douglas Miller                  BlueGene Messaging Development
IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
dougmill at us.ibm.com               Douglas Miller/Rochester/IBM

             "Underwood, Keith                                             
             D"                                                            
             <keith.d.underwoo                                          To 
             d at intel.com>              "MPI 3.0 Remote Memory Access       
             Sent by:                  working group"                      
             mpi3-rma-bounces@         <mpi3-rma at lists.mpi-forum.org>      
             lists.mpi-forum.o                                          cc 
             rg                                                            
                                                                   Subject 
                                       Re: [Mpi3-rma] RMA proposal 1       
             05/18/2010 10:51          update                              
             AM                                                            

             Please respond to                                             
              "MPI 3.0 Remote                                              
               Memory Access                                               
              working group"                                               
             <mpi3-rma at lists.m                                             
               pi-forum.org>                                               

Ah, so you allreduce(MPI_SUM, 1 million element integer vector == 4MB)?
Then you know how many things you should have received and when you have
received all of those you can enter a barrier?

Ok, now I am curious about a couple of other things:

1) Can you separate RMA operations from other operations in your counts?
E.g. one-sided, non-blocking collectives, etc.
2) is an allreduce of a big vector really more efficient in the general
case?  I can see how it might (maybe) be better for HPCC RandomAccess types
of things (i.e. if you did RandomAccess as Accumulate operations), but I am
dubious about whether it helps for anything else.

I'm just wanting to make sure that the optimization potential here is
real...

Thanks,
Keith

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> Sent: Tuesday, May 18, 2010 9:44 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> If each origin (in a fence epoch) keeps track of the count(s) of RMA
> operations to each of its targets, then an allreduce of those arrays
> will
> tell each target how many operations were done to itself and can be
> used to
> determine completion.
>
> _______________________________________________
> Douglas Miller                  BlueGene Messaging Development
> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
>
>
>
>              "Underwood, Keith
>              D"
>              <keith.d.underwoo
> To
>              d at intel.com>              "MPI 3.0 Remote Memory Access
>              Sent by:                  working group"
>              mpi3-rma-bounces@         <mpi3-rma at lists.mpi-forum.org>
>              lists.mpi-forum.o
> cc
>              rg
>
> Subject
>                                        Re: [Mpi3-rma] RMA proposal 1
>              05/18/2010 10:23          update
>              AM
>
>
>              Please respond to
>               "MPI 3.0 Remote
>                Memory Access
>               working group"
>              <mpi3-rma at lists.m
>                pi-forum.org>
>
>
>
>
>
>
> Sorry, but you lost me at “we could just do an allreduce to look at
> counts”.  Could you go into a bit more detail?  If you have received
> counts
> from all ranks at all ranks (um, that doesn’t seem scalable), then it
> would
> seem that an allfenceall() would require an Alltoall() to figure out if
> everybody was safe.  I don’t see how an allreduce would do the job.
> But,
> I’ll admit that I don’t know really DCMF or BG network interface
> architecture or… So, I could just be missing something here.
>
> Thanks,
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [
> mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Brian Smith
> Sent: Tuesday, May 18, 2010 4:57 AM
> To: MPI 3.0 Remote Memory Access working group
> Cc: MPI 3.0 Remote Memory Access working group;
> mpi3-rma-bounces at lists.mpi-forum.org
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
>
> Sorry for the late response....
> On BGP, DCMF Put/Get doesn't do any accounting and DCMF doesn't
> actually
> have a fence operation. There is no hardware to determine when a
> put/get
> has completed either. We need to send a get along the same
> (deterministically routed) path to "flush" any messages out to claim we
> are
> synchronized.
>
> When we implemented ARMCI, we introduced accounting in our "glue" on
> top of
> DCMF because of the ARMCI_Fence() operation. There are similar concerns
> in
> the MPI one-sided "glue".
>
> Going forward, we need to figure out how we'd implement the new MPI RMA
> operations and determine if there would be accounting required. If
> there
> would be (and I'm thinking there would), then an allfenceall in MPI
> would
> be easy enough to do and would provide a significant benefit on BG. We
> could just do an allreduce to look at counts. If the standard procedure
> is
> fenceall()+barrier(), I could do that much better as an allfenceall
> call.
>
> On platforms that have some sort of native accounting, this allfenceall
> would only be the overhead of a barrier. So I think an allfenceall has
> significant value to the middleware more than DCMF and therefore would
> strongly encourage it in MPI, especially given the use-cases we heard
> from
> Jeff H. at the forum meeting.
>
> This scenario is the same in our next super-secret product offering
> everyone knows about but I don't know if *I* can mention.
>
>
> Brian Smith (smithbr at us.ibm.com)
> BlueGene MPI Development/
> Communications Team Lead
> IBM Rochester
> Phone: 507 253 4717
>
>
>
>
>  From:   "Underwood, Keith D" <keith.d.underwood at intel.com>
>
>  To:     "MPI 3.0 Remote Memory Access working group"
>          <mpi3-rma at lists.mpi-forum.org>
>
>  Date:   05/16/2010 09:33 PM
>
>  Subject Re: [Mpi3-rma] RMA proposal 1 update
>  :
>
>  Sent    mpi3-rma-bounces at lists.mpi-forum.org
>  by:
>
>
>
>
>
>
>
>
> Before doing that, can someone sketch out the platform/API and the
> implementation that makes that more efficient?  There is no gain for
> Portals (3 or 4).  There is no gain for anything that supports Cray
> SHMEM
> reasonably well (shmem_quiet() is approximately the same semantics as
> MPI_flush_all).  Hrm, you can probably say the same thing about
> anything
> that supports UPC well - a strict access is basically a
> MPI_flush_all();
> MPI_Put(); MPI_flush_all();... Also, I thought somebody said that IB
> gave
> you a notification of remote completion...
>
> The question then turns to the "other networks".  If you can't figure
> out
> remote completion, then the collective is going to be pretty heavy,
> right?
>
> Keith
>
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> > Sent: Sunday, May 16, 2010 7:27 PM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > Tortsten,
> >
> > There seemed to be decent agreement on adding MPI_Win_all_flush_all
> > (equivalent to MPI_Win_flush_all called from every rank in the
> > communicator associated with the window) since this function can be
> > implemented far more efficiently as a collective than the equivalent
> > point-wise function calls.
> >
> > Is there a problem with adding this to your proposal?
> >
> > Jeff
> >
> > On Sun, May 16, 2010 at 12:48 AM, Torsten Hoefler <htor at illinois.edu>
> > wrote:
> > > Hello all,
> > >
> > > After the discussions at the last Forum I updated the group's first
> > > proposal.
> > >
> > > The proposal (one-side-2.pdf) is attached to the wiki page
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > >
> > > The changes with regards to the last version are:
> > >
> > > 1) added MPI_NOOP to MPI_Get_accumulate and MPI_Accumulate_get
> > >
> > > 2) (re)added MPI_Win_flush and MPI_Win_flush_all to passive target
> > mode
> > >
> > > Some remarks:
> > >
> > > 1) We didn't straw-vote on MPI_Accumulate_get, so this function
> might
> > >   go. The removal would be very clean.
> > >
> > > 2) Should we allow MPI_NOOP in MPI_Accumulate (this does not make
> > sense
> > >   and is incorrect in my current proposal)
> > >
> > > 3) Should we allow MPI_REPLACE in
> > MPI_Get_accumulate/MPI_Accumulate_get?
> > >   (this would make sense and is allowed in the current proposal but
> > we
> > >   didn't talk about it in the group)
> > >
> > >
> > > All the Best,
> > >  Torsten
> > >
> > > --
> > >  bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ ---
> --
> > > Torsten Hoefler         | Research Associate
> > > Blue Waters Directorate | University of Illinois
> > > 1205 W Clark Street     | Urbana, IL, 61801
> > > NCSA Building           | +01 (217) 244-7736
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > >
> >
> >
> >
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > jhammond at mcs.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma