[Mpi3-rma] MPI3 RMA Design Goals
William Gropp
wgropp at illinois.edu
Sat Sep 5 05:35:24 CDT 2009
One other possibility is to provide a separate path for "comm_world"
access - something like the MPI_WIN_WORLD in ticket #139 .
I've also added a note to the page to remind readers that requiring a
feature does prohibit supporting a specialization.
Bill
On Sep 4, 2009, at 11:11 AM, Underwood, Keith D wrote:
> I agree that the issue is fundamental on some networks, but it is
> not going to be fundamental across all networks. I’m more worried
> about the “not broken” networks ;-) Actually, I’m more worried
> about the ability to build a not-broken network for the purpose.
> And, having a collective allocation fixes most of the issues, but it
> does leave some open questions. I would like us to update the
> design goal to acknowledge that collective allocation is allowed,
> but not required.
>
> Anyway, as far as what I’m worried about, I’ll use a very
> specialized example to make the point...
>
> So, I’ll start by assuming shmem, Portals 3.3, a lightweight kernel,
> and hardware optimized to try to do that. I apologize to those not
> versed in Portals ;-)
>
> Let’s say that you shmem uses Portals to expose the entire virtual
> address space with an ME. The ME is persistent (never changes) and
> is the only item on a given portal table entry. The hardware takes
> the head of that list and caches it in an associative matching
> structure. Now you can process a message from the network every
> cycle… Oh, the hardware has to do virtual to physical address
> translation and it had better do protection based on whether a
> virtual page is physically backed or not, but, wait, I can constrain
> the kernel to always guarantee that ;-)
>
> At the transmitter, let’s say that you can push a (PE, portal table
> entry, offset) tuple as the target address directly to the hardware
> somehow… the T3E did something like this using E-registers, a
> centrifuge, and a machine partitioning approach. Ok, that takes a
> change to the Portals API, but we did that for Portals 4. Now you
> can push messages into the network very quickly too (PE is in the
> shmem call, portal table entry is a constant, offset is in virtual
> address space and is symmetric across nodes, so it is one or two
> arithmetic instructions).
>
> And, of course, there are platforms like the Cray X1 that had even
> better hardware support for this type of operation.
>
> Ok, so, how does the proposal we are discussing for MPI differ?
> 1) There is a communicator context. How much work this is
> depends on how much you need from that context.
> 2) There is a “memory object”. How many of these are there?
> Is there anything that strongly encourages the user to keep it
> relatively small? If not, how do I build good hardware support? I
> can’t have an unlimited number of unique system wide identifiers.
> 3) We have the nagging issue of non-coherent architectures and
> heterogeneous architectures. Ok, that one is probably workable,
> since vendor specific MPI implementations may drop both if their
> architecture is coherent and homogeneous. Of course, if it causes
> us to do something weird with the completion semantics (ala the
> ability to defer all actual data transfer to the end of a window in
> the current MPI-2 one sided operations), that could be an issue.
>
> So, the telling issue will be: if you put this proposal on a Cray
> X1 (ORNL still have one of those running?), how does it compare to
> the message rate of shmem on the same platform? Perhaps doing such
> a prototype would help us advance the discussion more than
> anything. We would have a baseline where we could say “what would
> it take to get competitive?”. Unfortunately, I don’t know of many
> other places that currently support shmem well enough to make a good
> comparison.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org
> ] On Behalf Of Vinod tipparaju
> Sent: Friday, September 04, 2009 9:41 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> hence the word "similar" to MDBind. MDBind can potentially a lot
> more as far as preparing the network. Binding is for reducing
> latency, for no other reason. I agree that in your example of random
> target communication, it is not useful.
>
> You are right and I have always agreed that when you do need such a
> table of mem-object data structures or a table of pointers under the
> hood of the implementation of collective mem-object data structures,
> random accesses will incur cache misses and latency. I get that. My
> points are(please let me know if you disagree to any of the bullets):
> 1) the problem doesn't exist if collective memobjs are used and the
> memobjs can internally allocate symmetric memory (same pointers and
> same "keys" across the system).
> 2) This is a more fundamental problem associated with if it is
> possible to allocate symmetric memory and corresponding symmetric
> key-ids on networks that require them.
> 3) this problem is same to shmem and these interfaces.
> 4) this problem of cache misses for random target communication call
> will occur if: a)the implementation of a collective object requires
> an array of keys or memory region identifies for sake of
> communication and creation of a single key is not possible, b) an
> array of pointers is required on this system instead of 1 symmetric
> pointer, and, c) if, say, ofed will have different keys on each node
> and the user cannot pass a number that he/she wants to be the key
> value -- there is nothing, you, me or any interface (the "users")
> can do to fix this.
> To me it seems like we are discussing about a case that is a
> fundamental problem that we don’t have control over. I cannot see
> how you would define an interface that will not have cache misses
> for random target communication in case of 4).
>
>
> So the question is are these interfaces causing cache misses when
> they can be avoided otherwise? I don’t think so. Do you? If you
> agree, we have concluded the discussion.
>
> --
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> From: keith.d.underwood at intel.com
> To: mpi3-rma at lists.mpi-forum.org
> Date: Fri, 4 Sep 2009 08:51:12 -0600
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Bill’s proposal was for a persistent handle for transmit operations
> that included things like target rank. I think there is some merit
> in that, though we need to evaluate the trade-offs. Unfortunately,
> that does not really do anything for the random communication
> scenario. In the non-random communication scenarios, I’m a lot
> less worried about things like random lookups of local data
> structures getting in the way.
>
> Bill’s proposal was nothing like MDBind in Portals, and MDBind does
> nothing to help the issue I was concerned about. Specifically,
> MDBind only associates local options and a local memory region with
> a handle. It says nothing about the target node or the target
> memory region. It is the lookup of information associated with the
> target node and target memory region that I am worried about and
> that is addressed by Bill’s proposal for the non-random access case.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org
> ] On Behalf Of Vinod tipparaju
> Sent: Thursday, September 03, 2009 10:39 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> I forgot to include an important parameter (communicator) in the
> psuedo interface below:
> Create_memobj_collective(IN user_ptr, IN size, IN communicator, OUT
> mem_obj)
>
> In addition to this Bill suggested Bind interface (something similar
> to MDBind in portals) that would help reduce latency for commonly re-
> used RMAs.
>
>
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
> From: wgropp at illinois.edu
> To: mpi3-rma at lists.mpi-forum.org
> Date: Thu, 3 Sep 2009 13:20:38 -0500
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Design goal one allows collective creation of objects; its there
> because many important algorithms don't have collective (over
> MPI_COMM_WORLD) allocation semantics, and a design that *requires*
> collective creation of memory objects will also limit the use of the
> interface.
>
> Bill
>
> On Sep 3, 2009, at 10:49 AM, Underwood, Keith D wrote:
>
> My commentary was on the design goals… if we allow collective
> creation of memory objects, and design goal #1 simply says we don’t
> require it, that may be ok. Design goal #1 could be interpreted to
> mean that you wouldn’t have collective creation in the semantic at
> all. Do you really anticipate one data type for an object that is
> either collectively or non-collectively created?
>
> I strongly disagree with your assertion that you can communicate
> with no cache misses for the non-collectively allocated memory
> object. In a non-collectively allocated case, you will have to keep
> an array of these on every process, right? i.e. one for every
> process you are communicating with? Randomly indexing that array is
> going to pound on your cache.
>
> We need to make sure that we don’t ignore the overhead having
> multiple communicators and heterogeneity. Yes, I think there are
> ways around this, but we should at least consider what is practical
> and likely rather than just what is possible.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 10:15 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
>
> You are correct in trying to look at the best possible case and
> estimating cache-misses/performance-bottlenecks. However, personally
> don't see any difference between this and shmem. When you cannot
> really allocate symmetric memory underneath, the amount of
> bookkeeping is same in both cases. When there is no heterogeneity,
> the check for this can be disabled at MPI startup. When there is
> heterogeneity we cannot compare with shmem.
>
> I cannot imagine not having symmetric/collective memory object
> creation to support these RMA interfaces, I think it is a must-have.
> Sorry I have only been saying we should have these interfaces but
> haven't given any example for this yet. Given how many times this
> same issue is coming up, I will do it now.
>
> Consider the creation interfaces:
> Create_memobj(IN user_ptr, IN size, OUT mem_obj)
> Create_memobj_collective(user_ptr, size, OUT mem_obj)
> Assign_memobj(IN/OUT mem_obj, IN user_address, IN size)
>
> There will be more details on how a mem object which is a result of
> create_memobj on process A will be exchanged with process B. When it
> is exchanged explicitly, the heterogeneity information can be
> created at process B.
>
> Now take the example with symmetric object:
>
> Process A
>
> myptr = allocate(mysize);
> Create_memobj_collective(myptr,mysize, all_obj);
> Do all kinds of RMA_Xfers
>
> and an example without symmetric object:
>
> myptr = allocate(mysize);
> Create_memobj(myptr,mysize,my_obj);
> ----exchange objects here----
> do all kinds of RAM_Xfers
>
> In both cases, I can see being able to communicate without any cache
> misses for mem_obj.
>
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
> From: keith.d.underwood at intel.com
> To: mpi3-rma at lists.mpi-forum.org
> Date: Tue, 1 Sep 2009 09:07:41 -0600
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> If we take the SINGLE_RMA_INTERFACE_DRAFT_PROPOSAL as an example,
> and combine it with the draft design goal #1: In order to support
> RMA to arbitrary locations, no constraints on memory, such as
> symmetric allocation or collective window creation, can be required
>
> We get an interesting view on how difficult it can be to get “close
> to the metal”. So, for MPI_RMA_xfer, we have to assume that the
> user has some array of target_mem data items. That means the
> sequence of steps in user space is:
>
> target_mem = ranks[dest];
> MPI_RMA_xfer(… target_mem, dest…);
>
> If we assume that the message sizes are small and the destinations
> randomly selected and the machine is large… every access to ranks is
> a cache miss, and we cannot prevent that by providing fancy
> hardware. This actually leads me to believe that we may need to
> reconsider design goal #1, or at least clarify what it means in a
> way that makes the access more efficient.
>
> MPI_RMA_xfer itself is no picnic either. If we take the draft
> design goal #5: The RMA model must support non-cache-coherent and
> heterogeneous environments, then MPI is required to maintain a data
> structure for every rank (ok, it has to do this anyway, but we are
> trying to get close to the metal) and do a lookup into that data
> structure with every MPI_RMA_xfer to find out if the target is
> heterogeneous relative to the target rank – another cache miss.
> Now, nominally, since this is inside MPI, a lower layer could absorb
> that check… or, a given MPI could refuse to support heterogeneity
> or… but, you get the idea.
>
> So, we’ve got two cache line loads for every transfer. One in the
> application and one in the MPI library. One is impossible to move
> to the hardware and the other is simply very difficult to move.
>
> For a contrast, look at SHMEM. Assume homogeneous, only one
> communicator context, and hardware mapping of ranks to physical
> locations. A shmem_put() of a short item could literally be turned
> into a few instructions and a processor store (on machines that
> supported such things). Personally, I think we will have done well
> if we can get to the point that a reasonable hardware implementation
> can get MPI RMA to within 2x of a reasonable SHMEM implementation.
> I think we have a long way to go to get there.
>
> Keith
>
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 5:23 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Good points! RMA interfaces should do nothing to prevent utilizing a
> high message rate (or low overhead communication) that the
> underlying hardware may offer. To ensure this happens, there should
> always be a unrestricted path (lets call it this for now, people
> have called it a "thin layer", "direct access") to the network.
>
> This means, despite the fact the the RMA interface has features that
> abstract out complexity by providing useful characteristics such as
> ordering and atomicity, it (the RMA interface) should always have
> this unrestricted, close to the heart of the hardware, path. To
> achieve this, the unrestricted path should not require any book
> keeping (from implementation perspective) in relation to the feature-
> rich path or vice-versa.
>
> I believe this is what we have demonstrated with the example
> interfaces hence the null set isn't the case here :-). I will
> distribute an example implementation very soon so people can get a
> feel.
>
> ---
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> > From: keith.d.underwood at intel.com
> > To: mpi3-rma at lists.mpi-forum.org
> > Date: Mon, 31 Aug 2009 16:17:28 -0600
> > Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
> >
> > There has been stunning silence since this email, so I will go
> ahead and toss out a thought...
> >
> > In the draft design goals, I don't see two issues that I see as
> key. The first is "support for high message rate/low overhead
> communications to random targets". As best I can tell, this is one
> of the key places were the existing one-sided operations are
> perceived as falling down for existing customers of SHMEM/PGAS. The
> second is "elimination of the access epoch requirement". This one
> may be, um, more controversial, but I believe it is part and parcel
> with the first one. That is, the first one is not that valuable if
> the programming model requires an excessive amount of access epoch
> opens and closes just to force the global visibility of the
> operations. Unfortunately, the intersection of this solution space
> with the solution space for the current draft design goal #5
> (support non-cache-coherent and heterogeneous environments) may be
> the null set... I will hold out hope that this isn't the case ;-)
> >
> > Keith
> >
> > > -----Original Message-----
> > > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > > bounces at lists.mpi-forum.org] On Behalf Of William Gropp
> > > Sent: Wednesday, August 05, 2009 12:37 PM
> > > To: mpi3-rma at lists.mpi-forum.org
> > > Subject: [Mpi3-rma] MPI3 RMA Design Goals
> > >
> > > I've added versions of the RMA design goals that we discussed at
> the
> > > Forum meeting last week to the wiki page for our group (
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > > ). This is a draft; lets discuss these. Also, feel free to add to
> > > the discussion, particularly in the background section.
> > >
> > > Bill
> > >
> > > William Gropp
> > > Deputy Director for Research
> > > Institute for Advanced Computing Applications and Technologies
> > > Paul and Cynthia Saylor Professor of Computer Science
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> <ATT00001.txt>
>
> William Gropp
> Deputy Director for Research
> Institute for Advanced Computing Applications and Technologies
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>
>
>
>
> <ATT00001.txt>
William Gropp
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20090905/11d40c59/attachment-0001.html>
More information about the mpiwg-rma
mailing list