[Mpi3-rma] RMA proposal 1 update

Tue May 18 10:52:35 CDT 2010

just the trivial remark that a reduce_scatter_block operation also
does this counting and may be more efficient

Jesper

On Tue, May 18, 2010 at 10:44:19AM -0500, Douglas Miller wrote:
> If each origin (in a fence epoch) keeps track of the count(s) of RMA
> operations to each of its targets, then an allreduce of those arrays will
> tell each target how many operations were done to itself and can be used to
> determine completion.
> 
> _______________________________________________
> Douglas Miller                  BlueGene Messaging Development
> IBM Corp., Rochester, MN USA                     Bldg 030-2 A410
> dougmill at us.ibm.com               Douglas Miller/Rochester/IBM
> 
> 
>                                                                            
>              "Underwood, Keith                                             
>              D"                                                            
>              <keith.d.underwoo                                          To 
>              d at intel.com>              "MPI 3.0 Remote Memory Access       
>              Sent by:                  working group"                      
>              mpi3-rma-bounces@         <mpi3-rma at lists.mpi-forum.org>      
>              lists.mpi-forum.o                                          cc 
>              rg                                                            
>                                                                    Subject 
>                                        Re: [Mpi3-rma] RMA proposal 1       
>              05/18/2010 10:23          update                              
>              AM                                                            
>                                                                            
>                                                                            
>              Please respond to                                             
>               "MPI 3.0 Remote                                              
>                Memory Access                                               
>               working group"                                               
>              <mpi3-rma at lists.m                                             
>                pi-forum.org>                                               
>                                                                            
>                                                                            
> 
> 
> 
> 
> Sorry, but you lost me at ?we could just do an allreduce to look at
> counts?.  Could you go into a bit more detail?  If you have received counts
> from all ranks at all ranks (um, that doesn?t seem scalable), then it would
> seem that an allfenceall() would require an Alltoall() to figure out if
> everybody was safe.  I don?t see how an allreduce would do the job.   But,
> I?ll admit that I don?t know really DCMF or BG network interface
> architecture or? So, I could just be missing something here.
> 
> Thanks,
> Keith
> 
> From: mpi3-rma-bounces at lists.mpi-forum.org [
> mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Brian Smith
> Sent: Tuesday, May 18, 2010 4:57 AM
> To: MPI 3.0 Remote Memory Access working group
> Cc: MPI 3.0 Remote Memory Access working group;
> mpi3-rma-bounces at lists.mpi-forum.org
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
> 
> 
> Sorry for the late response....
> On BGP, DCMF Put/Get doesn't do any accounting and DCMF doesn't actually
> have a fence operation. There is no hardware to determine when a put/get
> has completed either. We need to send a get along the same
> (deterministically routed) path to "flush" any messages out to claim we are
> synchronized.
> 
> When we implemented ARMCI, we introduced accounting in our "glue" on top of
> DCMF because of the ARMCI_Fence() operation. There are similar concerns in
> the MPI one-sided "glue".
> 
> Going forward, we need to figure out how we'd implement the new MPI RMA
> operations and determine if there would be accounting required. If there
> would be (and I'm thinking there would), then an allfenceall in MPI would
> be easy enough to do and would provide a significant benefit on BG. We
> could just do an allreduce to look at counts. If the standard procedure is
> fenceall()+barrier(), I could do that much better as an allfenceall call.
> 
> On platforms that have some sort of native accounting, this allfenceall
> would only be the overhead of a barrier. So I think an allfenceall has
> significant value to the middleware more than DCMF and therefore would
> strongly encourage it in MPI, especially given the use-cases we heard from
> Jeff H. at the forum meeting.
> 
> This scenario is the same in our next super-secret product offering
> everyone knows about but I don't know if *I* can mention.
> 
> 
> Brian Smith (smithbr at us.ibm.com)
> BlueGene MPI Development/
> Communications Team Lead
> IBM Rochester
> Phone: 507 253 4717
> 
> 
> 
>                                                                            
>  From:   "Underwood, Keith D" <keith.d.underwood at intel.com>                
>                                                                            
>  To:     "MPI 3.0 Remote Memory Access working group"                      
>          <mpi3-rma at lists.mpi-forum.org>                                    
>                                                                            
>  Date:   05/16/2010 09:33 PM                                               
>                                                                            
>  Subject Re: [Mpi3-rma] RMA proposal 1 update                              
>  :                                                                         
>                                                                            
>  Sent    mpi3-rma-bounces at lists.mpi-forum.org                              
>  by:                                                                       
>                                                                            
> 
> 
> 
> 
> 
> 
> 
> Before doing that, can someone sketch out the platform/API and the
> implementation that makes that more efficient?  There is no gain for
> Portals (3 or 4).  There is no gain for anything that supports Cray SHMEM
> reasonably well (shmem_quiet() is approximately the same semantics as
> MPI_flush_all).  Hrm, you can probably say the same thing about anything
> that supports UPC well - a strict access is basically a MPI_flush_all();
> MPI_Put(); MPI_flush_all();... Also, I thought somebody said that IB gave
> you a notification of remote completion...
> 
> The question then turns to the "other networks".  If you can't figure out
> remote completion, then the collective is going to be pretty heavy, right?
> 
> Keith
> 
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> > Sent: Sunday, May 16, 2010 7:27 PM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > Tortsten,
> >
> > There seemed to be decent agreement on adding MPI_Win_all_flush_all
> > (equivalent to MPI_Win_flush_all called from every rank in the
> > communicator associated with the window) since this function can be
> > implemented far more efficiently as a collective than the equivalent
> > point-wise function calls.
> >
> > Is there a problem with adding this to your proposal?
> >
> > Jeff
> >
> > On Sun, May 16, 2010 at 12:48 AM, Torsten Hoefler <htor at illinois.edu>
> > wrote:
> > > Hello all,
> > >
> > > After the discussions at the last Forum I updated the group's first
> > > proposal.
> > >
> > > The proposal (one-side-2.pdf) is attached to the wiki page
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > >
> > > The changes with regards to the last version are:
> > >
> > > 1) added MPI_NOOP to MPI_Get_accumulate and MPI_Accumulate_get
> > >
> > > 2) (re)added MPI_Win_flush and MPI_Win_flush_all to passive target
> > mode
> > >
> > > Some remarks:
> > >
> > > 1) We didn't straw-vote on MPI_Accumulate_get, so this function might
> > >   go. The removal would be very clean.
> > >
> > > 2) Should we allow MPI_NOOP in MPI_Accumulate (this does not make
> > sense
> > >   and is incorrect in my current proposal)
> > >
> > > 3) Should we allow MPI_REPLACE in
> > MPI_Get_accumulate/MPI_Accumulate_get?
> > >   (this would make sense and is allowed in the current proposal but
> > we
> > >   didn't talk about it in the group)
> > >
> > >
> > > All the Best,
> > >  Torsten
> > >
> > > --
> > >  bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
> > > Torsten Hoefler         | Research Associate
> > > Blue Waters Directorate | University of Illinois
> > > 1205 W Clark Street     | Urbana, IL, 61801
> > > NCSA Building           | +01 (217) 244-7736
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > >
> >
> >
> >
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > jhammond at mcs.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> 
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma