[Mpi3-rma] MPI3 RMA Design Goals

William Gropp wgropp at illinois.edu
Sat Sep 5 05:35:24 CDT 2009


One other possibility is to provide a separate path for "comm_world"  
access - something like the MPI_WIN_WORLD in ticket #139 .

I've also added a note to the page to remind readers that requiring a  
feature does prohibit supporting a specialization.

Bill

On Sep 4, 2009, at 11:11 AM, Underwood, Keith D wrote:

> I agree that the issue is fundamental on some networks, but it is  
> not going to be fundamental across all networks.  I’m more worried  
> about the “not broken” networks ;-)  Actually, I’m more worried  
> about the ability to build a not-broken network for the purpose.     
> And, having a collective allocation fixes most of the issues, but it  
> does leave some open questions.  I would like us to update the  
> design goal to acknowledge that collective allocation is allowed,  
> but not required.
>
> Anyway, as far as what I’m worried about, I’ll use a very  
> specialized example to make the point...
>
> So, I’ll start by assuming shmem, Portals 3.3, a lightweight kernel,  
> and hardware optimized to try to do that.  I apologize to those not  
> versed in Portals ;-)
>
> Let’s say that you shmem uses Portals to expose the entire virtual  
> address space with an ME.  The ME is persistent (never changes) and  
> is the only item on a given portal table entry.  The hardware takes  
> the head of that list and caches it in an associative matching  
> structure.  Now you can process a message from the network every  
> cycle…  Oh, the hardware has to do virtual to physical address  
> translation and it had better do protection based on whether a  
> virtual page is physically backed or not, but, wait, I can constrain  
> the kernel to always guarantee that ;-)
>
> At the transmitter, let’s say that you can push a (PE, portal table  
> entry, offset) tuple as the target address directly to the hardware  
> somehow… the T3E did something like this using E-registers, a  
> centrifuge, and a machine partitioning approach.  Ok, that takes a  
> change to the Portals API, but we did that for Portals 4.  Now you  
> can push messages into the network very quickly too (PE is in the  
> shmem call, portal table entry is a constant, offset is in virtual  
> address space and is symmetric across nodes, so it is one or two  
> arithmetic instructions).
>
> And, of course, there are platforms like the Cray X1 that had even  
> better hardware support for this type of operation.
>
> Ok, so, how does the proposal we are discussing for MPI differ?
> 1)      There is a communicator context.  How much work this is  
> depends on how much you need from that context.
> 2)      There is a “memory object”.  How many of these are there?   
> Is there anything that strongly encourages the user to keep it  
> relatively small?  If not, how do I build good hardware support?  I  
> can’t have an unlimited number of unique system wide identifiers.
> 3)      We have the nagging issue of non-coherent architectures and  
> heterogeneous architectures.  Ok, that one is probably workable,  
> since vendor specific MPI implementations may drop both if their  
> architecture is coherent and homogeneous.  Of course, if it causes  
> us to do something weird with the completion semantics (ala the  
> ability to defer all actual data transfer to the end of a window in  
> the current MPI-2 one sided operations), that could be an issue.
>
> So, the telling issue will be:  if you put this proposal on a Cray  
> X1 (ORNL still have one of those running?), how does it compare to  
> the message rate of shmem on the same platform?   Perhaps doing such  
> a prototype would help us advance the discussion more than  
> anything.  We would have a baseline where we could say “what would  
> it take to get competitive?”.  Unfortunately, I don’t know of many  
> other places that currently support shmem well enough to make a good  
> comparison.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Friday, September 04, 2009 9:41 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> hence the word "similar" to MDBind. MDBind can potentially a lot  
> more as far as preparing the network. Binding is for reducing  
> latency, for no other reason. I agree that in your example of random  
> target communication, it is not useful.
>
> You are right and I have always agreed that when you do need such a  
> table of mem-object data structures or a table of pointers under the  
> hood of the implementation of collective mem-object data structures,  
> random accesses will incur cache misses and latency. I get that. My  
> points are(please let me know if you disagree to any of the bullets):
> 1) the problem doesn't exist if collective memobjs are used and the  
> memobjs can internally allocate symmetric memory (same pointers and  
> same "keys" across the system).
> 2) This is a more fundamental problem associated with if it is  
> possible to allocate symmetric memory and corresponding symmetric  
> key-ids on networks that require them.
> 3) this problem is same to shmem and these interfaces.
> 4) this problem of cache misses for random target communication call  
> will occur if: a)the implementation of a collective object requires  
> an array of keys or memory region identifies for sake of  
> communication and creation of a single key is not possible, b) an  
> array of pointers is required on this system instead of 1 symmetric  
> pointer, and, c) if, say, ofed will have different keys on each node  
> and the user cannot pass a number that he/she wants to be the key  
> value -- there is nothing, you, me or any interface (the "users")  
> can do to fix this.
> To me it seems like we are discussing about a case that is a  
> fundamental problem that we don’t have control over. I cannot see  
> how you would define an interface that will not have cache misses  
> for random target communication in case of 4).
>
>
> So the question is are these interfaces causing cache misses when  
> they can be avoided otherwise? I don’t think so. Do you? If you  
> agree, we have concluded the discussion.
>
> --
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> From: keith.d.underwood at intel.com
> To: mpi3-rma at lists.mpi-forum.org
> Date: Fri, 4 Sep 2009 08:51:12 -0600
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Bill’s proposal was for a persistent handle for transmit operations  
> that included things like target rank.  I think there is some merit  
> in that, though we need to evaluate the trade-offs.  Unfortunately,  
> that does not really do anything for the random communication  
> scenario.   In the non-random communication scenarios, I’m a lot  
> less worried about things like random lookups of local data  
> structures getting in the way.
>
> Bill’s proposal was nothing like MDBind in Portals, and MDBind does  
> nothing to help the issue I was concerned about.  Specifically,  
> MDBind only associates local options and a local memory region with  
> a handle.  It says nothing about the target node or the target  
> memory region.  It is the lookup of information associated with the  
> target node and target memory region that I am worried about and  
> that is addressed by Bill’s proposal for the non-random access case.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Thursday, September 03, 2009 10:39 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> I forgot to include an important parameter (communicator) in the  
> psuedo interface below:
> Create_memobj_collective(IN user_ptr, IN size, IN communicator, OUT  
> mem_obj)
>
> In addition to this Bill suggested Bind interface (something similar  
> to MDBind in portals) that would help reduce latency for commonly re- 
> used RMAs.
>
>
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
> From: wgropp at illinois.edu
> To: mpi3-rma at lists.mpi-forum.org
> Date: Thu, 3 Sep 2009 13:20:38 -0500
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Design goal one allows collective creation of objects; its there  
> because many important algorithms don't have collective (over  
> MPI_COMM_WORLD) allocation semantics, and a design that *requires*  
> collective creation of memory objects will also limit the use of the  
> interface.
>
> Bill
>
> On Sep 3, 2009, at 10:49 AM, Underwood, Keith D wrote:
>
> My commentary was on the design goals…  if we allow collective  
> creation of memory objects, and design goal #1 simply says we don’t  
> require it, that may be ok.  Design goal #1 could be interpreted to  
> mean that you wouldn’t have collective creation in the semantic at  
> all.  Do you really anticipate one data type for an object that is  
> either collectively or non-collectively created?
>
> I strongly disagree with your assertion that you can communicate  
> with no cache misses for the non-collectively allocated memory  
> object.  In a non-collectively allocated case, you will have to keep  
> an array of these on every process, right?  i.e. one for every  
> process you are communicating with?  Randomly indexing that array is  
> going to pound on your cache.
>
> We need to make sure that we don’t ignore the overhead having  
> multiple communicators and heterogeneity.  Yes, I think there are  
> ways around this, but we should at least consider what is practical  
> and likely rather than just what is possible.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 10:15 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
>
> You are correct in trying to look at the best possible case and  
> estimating cache-misses/performance-bottlenecks. However, personally  
> don't see any difference between this and shmem. When you cannot  
> really allocate symmetric memory underneath, the amount of  
> bookkeeping is same in both cases. When there is no heterogeneity,  
> the check for this can be disabled at MPI startup. When there is  
> heterogeneity we cannot compare with shmem.
>
> I cannot imagine not having symmetric/collective memory object  
> creation to support these RMA interfaces, I think it is a must-have.  
> Sorry I have only been saying we should have these interfaces but  
> haven't given any example for this yet. Given how many times this  
> same issue is coming up, I will do it now.
>
> Consider the creation interfaces:
> Create_memobj(IN user_ptr, IN size, OUT mem_obj)
> Create_memobj_collective(user_ptr, size, OUT mem_obj)
> Assign_memobj(IN/OUT mem_obj, IN user_address, IN size)
>
> There will be more details on how a mem object which is a result of  
> create_memobj on process A will be exchanged with process B. When it  
> is exchanged explicitly, the heterogeneity information can be  
> created at process B.
>
> Now take the example with symmetric object:
>
> Process A
>
> myptr = allocate(mysize);
> Create_memobj_collective(myptr,mysize, all_obj);
> Do all kinds of RMA_Xfers
>
> and an example without symmetric object:
>
> myptr = allocate(mysize);
> Create_memobj(myptr,mysize,my_obj);
>  ----exchange objects here----
> do all kinds of RAM_Xfers
>
> In both cases, I can see being able to communicate without any cache  
> misses for mem_obj.
>
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
> From: keith.d.underwood at intel.com
> To: mpi3-rma at lists.mpi-forum.org
> Date: Tue, 1 Sep 2009 09:07:41 -0600
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> If we take the SINGLE_RMA_INTERFACE_DRAFT_PROPOSAL as an example,  
> and combine it with the draft design goal #1: In order to support  
> RMA to arbitrary locations, no constraints on memory, such as  
> symmetric allocation or collective window creation, can be required
>
> We get an interesting view on how difficult it can be to get “close  
> to the metal”.  So, for MPI_RMA_xfer, we have to assume that the  
> user has some array of target_mem data items.  That means the  
> sequence of steps in user space is:
>
> target_mem = ranks[dest];
> MPI_RMA_xfer(… target_mem, dest…);
>
> If we assume that the message sizes are small and the destinations  
> randomly selected and the machine is large… every access to ranks is  
> a cache miss, and we cannot prevent that by providing fancy  
> hardware.  This actually leads me to believe that we may need to  
> reconsider design goal #1, or at least clarify what it means in a  
> way that makes the access more efficient.
>
> MPI_RMA_xfer itself is no picnic either.  If we take the draft  
> design goal #5: The RMA model must support non-cache-coherent and  
> heterogeneous environments, then MPI is required to maintain a data  
> structure for every rank (ok, it has to do this anyway, but we are  
> trying to get close to the metal) and do a lookup into that data  
> structure with every MPI_RMA_xfer to find out if the target is  
> heterogeneous relative to the target rank – another cache miss.   
> Now, nominally, since this is inside MPI, a lower layer could absorb  
> that check… or, a given MPI could refuse to support heterogeneity  
> or… but, you get the idea.
>
> So, we’ve got two cache line loads for every transfer.  One in the  
> application and one in the MPI library.  One is impossible to move  
> to the hardware and the other is simply very difficult to move.
>
> For a contrast, look at SHMEM.  Assume homogeneous, only one  
> communicator context, and hardware mapping of ranks to physical  
> locations.  A shmem_put() of a short item could literally be turned  
> into a few instructions and a processor store (on machines that  
> supported such things).  Personally, I think we will have done well  
> if we can get to the point that a reasonable hardware implementation  
> can get MPI RMA to within 2x of a reasonable SHMEM implementation.   
> I think we have a long way to go to get there.
>
> Keith
>
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 5:23 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Good points! RMA interfaces should do nothing to prevent utilizing a  
> high message rate (or low overhead communication) that the  
> underlying hardware may offer. To ensure this happens, there should  
> always be a unrestricted path (lets call it this for now, people  
> have called it a "thin layer", "direct access") to the network.
>
> This means, despite the fact the the RMA interface has features that  
> abstract out complexity by providing useful characteristics such as  
> ordering and atomicity, it (the RMA interface) should always have  
> this unrestricted, close to the heart of the hardware, path. To  
> achieve this, the unrestricted path should not require any book  
> keeping (from implementation perspective) in relation to the feature- 
> rich path or vice-versa.
>
> I believe this is what we have demonstrated with the example  
> interfaces hence the null set isn't the case here :-). I will  
> distribute an example implementation very soon so people can get a  
> feel.
>
> ---
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> > From: keith.d.underwood at intel.com
> > To: mpi3-rma at lists.mpi-forum.org
> > Date: Mon, 31 Aug 2009 16:17:28 -0600
> > Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
> >
> > There has been stunning silence since this email, so I will go  
> ahead and toss out a thought...
> >
> > In the draft design goals, I don't see two issues that I see as  
> key. The first is "support for high message rate/low overhead  
> communications to random targets". As best I can tell, this is one  
> of the key places were the existing one-sided operations are  
> perceived as falling down for existing customers of SHMEM/PGAS. The  
> second is "elimination of the access epoch requirement". This one  
> may be, um, more controversial, but I believe it is part and parcel  
> with the first one. That is, the first one is not that valuable if  
> the programming model requires an excessive amount of access epoch  
> opens and closes just to force the global visibility of the  
> operations. Unfortunately, the intersection of this solution space  
> with the solution space for the current draft design goal #5  
> (support non-cache-coherent and heterogeneous environments) may be  
> the null set... I will hold out hope that this isn't the case ;-)
> >
> > Keith
> >
> > > -----Original Message-----
> > > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > > bounces at lists.mpi-forum.org] On Behalf Of William Gropp
> > > Sent: Wednesday, August 05, 2009 12:37 PM
> > > To: mpi3-rma at lists.mpi-forum.org
> > > Subject: [Mpi3-rma] MPI3 RMA Design Goals
> > >
> > > I've added versions of the RMA design goals that we discussed at  
> the
> > > Forum meeting last week to the wiki page for our group (
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > > ). This is a draft; lets discuss these. Also, feel free to add to
> > > the discussion, particularly in the background section.
> > >
> > > Bill
> > >
> > > William Gropp
> > > Deputy Director for Research
> > > Institute for Advanced Computing Applications and Technologies
> > > Paul and Cynthia Saylor Professor of Computer Science
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> <ATT00001.txt>
>
> William Gropp
> Deputy Director for Research
> Institute for Advanced Computing Applications and Technologies
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>
>
>
>
> <ATT00001.txt>

William Gropp
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20090905/11d40c59/attachment-0001.html>


More information about the mpiwg-rma mailing list