[Mpi3-rma] RMA proposal 1 update
Douglas Miller
dougmill at us.ibm.com
Tue May 18 11:04:22 CDT 2010
If the fence epoch involved 1 million ranks all doing origin and/or target
operations, then the allreduce is likely to be efficient (unless a
reduce_scatter_block sort of thing exists). If there are only 10
participants doing RMAs in a 1 million participant fence epoch, one would
have to question why use fence for the synchronization - or why use a 1
million member communicator to create the window. That does not seem like
an optimal communication pattern.
_______________________________________________
Douglas Miller BlueGene Messaging Development
IBM Corp., Rochester, MN USA Bldg 030-2 A410
dougmill at us.ibm.com Douglas Miller/Rochester/IBM
"Underwood, Keith
D"
<keith.d.underwoo To
d at intel.com> "MPI 3.0 Remote Memory Access
Sent by: working group"
mpi3-rma-bounces@ <mpi3-rma at lists.mpi-forum.org>
lists.mpi-forum.o cc
rg
Subject
Re: [Mpi3-rma] RMA proposal 1
05/18/2010 10:51 update
AM
Please respond to
"MPI 3.0 Remote
Memory Access
working group"
<mpi3-rma at lists.m
pi-forum.org>
Ah, so you allreduce(MPI_SUM, 1 million element integer vector == 4MB)?
Then you know how many things you should have received and when you have
received all of those you can enter a barrier?
Ok, now I am curious about a couple of other things:
1) Can you separate RMA operations from other operations in your counts?
E.g. one-sided, non-blocking collectives, etc.
2) is an allreduce of a big vector really more efficient in the general
case? I can see how it might (maybe) be better for HPCC RandomAccess types
of things (i.e. if you did RandomAccess as Accumulate operations), but I am
dubious about whether it helps for anything else.
I'm just wanting to make sure that the optimization potential here is
real...
Thanks,
Keith
> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> bounces at lists.mpi-forum.org] On Behalf Of Douglas Miller
> Sent: Tuesday, May 18, 2010 9:44 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
> If each origin (in a fence epoch) keeps track of the count(s) of RMA
> operations to each of its targets, then an allreduce of those arrays
> will
> tell each target how many operations were done to itself and can be
> used to
> determine completion.
>
> _______________________________________________
> Douglas Miller BlueGene Messaging Development
> IBM Corp., Rochester, MN USA Bldg 030-2 A410
> dougmill at us.ibm.com Douglas Miller/Rochester/IBM
>
>
>
> "Underwood, Keith
> D"
> <keith.d.underwoo
> To
> d at intel.com> "MPI 3.0 Remote Memory Access
> Sent by: working group"
> mpi3-rma-bounces@ <mpi3-rma at lists.mpi-forum.org>
> lists.mpi-forum.o
> cc
> rg
>
> Subject
> Re: [Mpi3-rma] RMA proposal 1
> 05/18/2010 10:23 update
> AM
>
>
> Please respond to
> "MPI 3.0 Remote
> Memory Access
> working group"
> <mpi3-rma at lists.m
> pi-forum.org>
>
>
>
>
>
>
> Sorry, but you lost me at “we could just do an allreduce to look at
> counts”. Could you go into a bit more detail? If you have received
> counts
> from all ranks at all ranks (um, that doesn’t seem scalable), then it
> would
> seem that an allfenceall() would require an Alltoall() to figure out if
> everybody was safe. I don’t see how an allreduce would do the job.
> But,
> I’ll admit that I don’t know really DCMF or BG network interface
> architecture or… So, I could just be missing something here.
>
> Thanks,
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [
> mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of Brian Smith
> Sent: Tuesday, May 18, 2010 4:57 AM
> To: MPI 3.0 Remote Memory Access working group
> Cc: MPI 3.0 Remote Memory Access working group;
> mpi3-rma-bounces at lists.mpi-forum.org
> Subject: Re: [Mpi3-rma] RMA proposal 1 update
>
>
> Sorry for the late response....
> On BGP, DCMF Put/Get doesn't do any accounting and DCMF doesn't
> actually
> have a fence operation. There is no hardware to determine when a
> put/get
> has completed either. We need to send a get along the same
> (deterministically routed) path to "flush" any messages out to claim we
> are
> synchronized.
>
> When we implemented ARMCI, we introduced accounting in our "glue" on
> top of
> DCMF because of the ARMCI_Fence() operation. There are similar concerns
> in
> the MPI one-sided "glue".
>
> Going forward, we need to figure out how we'd implement the new MPI RMA
> operations and determine if there would be accounting required. If
> there
> would be (and I'm thinking there would), then an allfenceall in MPI
> would
> be easy enough to do and would provide a significant benefit on BG. We
> could just do an allreduce to look at counts. If the standard procedure
> is
> fenceall()+barrier(), I could do that much better as an allfenceall
> call.
>
> On platforms that have some sort of native accounting, this allfenceall
> would only be the overhead of a barrier. So I think an allfenceall has
> significant value to the middleware more than DCMF and therefore would
> strongly encourage it in MPI, especially given the use-cases we heard
> from
> Jeff H. at the forum meeting.
>
> This scenario is the same in our next super-secret product offering
> everyone knows about but I don't know if *I* can mention.
>
>
> Brian Smith (smithbr at us.ibm.com)
> BlueGene MPI Development/
> Communications Team Lead
> IBM Rochester
> Phone: 507 253 4717
>
>
>
>
> From: "Underwood, Keith D" <keith.d.underwood at intel.com>
>
> To: "MPI 3.0 Remote Memory Access working group"
> <mpi3-rma at lists.mpi-forum.org>
>
> Date: 05/16/2010 09:33 PM
>
> Subject Re: [Mpi3-rma] RMA proposal 1 update
> :
>
> Sent mpi3-rma-bounces at lists.mpi-forum.org
> by:
>
>
>
>
>
>
>
>
> Before doing that, can someone sketch out the platform/API and the
> implementation that makes that more efficient? There is no gain for
> Portals (3 or 4). There is no gain for anything that supports Cray
> SHMEM
> reasonably well (shmem_quiet() is approximately the same semantics as
> MPI_flush_all). Hrm, you can probably say the same thing about
> anything
> that supports UPC well - a strict access is basically a
> MPI_flush_all();
> MPI_Put(); MPI_flush_all();... Also, I thought somebody said that IB
> gave
> you a notification of remote completion...
>
> The question then turns to the "other networks". If you can't figure
> out
> remote completion, then the collective is going to be pretty heavy,
> right?
>
> Keith
>
> > -----Original Message-----
> > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > bounces at lists.mpi-forum.org] On Behalf Of Jeff Hammond
> > Sent: Sunday, May 16, 2010 7:27 PM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] RMA proposal 1 update
> >
> > Tortsten,
> >
> > There seemed to be decent agreement on adding MPI_Win_all_flush_all
> > (equivalent to MPI_Win_flush_all called from every rank in the
> > communicator associated with the window) since this function can be
> > implemented far more efficiently as a collective than the equivalent
> > point-wise function calls.
> >
> > Is there a problem with adding this to your proposal?
> >
> > Jeff
> >
> > On Sun, May 16, 2010 at 12:48 AM, Torsten Hoefler <htor at illinois.edu>
> > wrote:
> > > Hello all,
> > >
> > > After the discussions at the last Forum I updated the group's first
> > > proposal.
> > >
> > > The proposal (one-side-2.pdf) is attached to the wiki page
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > >
> > > The changes with regards to the last version are:
> > >
> > > 1) added MPI_NOOP to MPI_Get_accumulate and MPI_Accumulate_get
> > >
> > > 2) (re)added MPI_Win_flush and MPI_Win_flush_all to passive target
> > mode
> > >
> > > Some remarks:
> > >
> > > 1) We didn't straw-vote on MPI_Accumulate_get, so this function
> might
> > > go. The removal would be very clean.
> > >
> > > 2) Should we allow MPI_NOOP in MPI_Accumulate (this does not make
> > sense
> > > and is incorrect in my current proposal)
> > >
> > > 3) Should we allow MPI_REPLACE in
> > MPI_Get_accumulate/MPI_Accumulate_get?
> > > (this would make sense and is allowed in the current proposal but
> > we
> > > didn't talk about it in the group)
> > >
> > >
> > > All the Best,
> > > Torsten
> > >
> > > --
> > > bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ ---
> --
> > > Torsten Hoefler | Research Associate
> > > Blue Waters Directorate | University of Illinois
> > > 1205 W Clark Street | Urbana, IL, 61801
> > > NCSA Building | +01 (217) 244-7736
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> > >
> >
> >
> >
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > jhammond at mcs.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
More information about the mpiwg-rma
mailing list