[Mpi3-rma] MPI3 RMA Design Goals

Thu Sep 3 13:20:38 CDT 2009

Design goal one allows collective creation of objects; its there  
because many important algorithms don't have collective (over  
MPI_COMM_WORLD) allocation semantics, and a design that *requires*  
collective creation of memory objects will also limit the use of the  
interface.

Bill

On Sep 3, 2009, at 10:49 AM, Underwood, Keith D wrote:

> My commentary was on the design goals…  if we allow collective  
> creation of memory objects, and design goal #1 simply says we don’t  
> require it, that may be ok.  Design goal #1 could be interpreted to  
> mean that you wouldn’t have collective creation in the semantic at  
> all.  Do you really anticipate one data type for an object that is  
> either collectively or non-collectively created?
>
> I strongly disagree with your assertion that you can communicate  
> with no cache misses for the non-collectively allocated memory  
> object.  In a non-collectively allocated case, you will have to keep  
> an array of these on every process, right?  i.e. one for every  
> process you are communicating with?  Randomly indexing that array is  
> going to pound on your cache.
>
> We need to make sure that we don’t ignore the overhead having  
> multiple communicators and heterogeneity.  Yes, I think there are  
> ways around this, but we should at least consider what is practical  
> and likely rather than just what is possible.
>
> Keith
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 10:15 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
>
> You are correct in trying to look at the best possible case and  
> estimating cache-misses/performance-bottlenecks. However, personally  
> don't see any difference between this and shmem. When you cannot  
> really allocate symmetric memory underneath, the amount of  
> bookkeeping is same in both cases. When there is no heterogeneity,  
> the check for this can be disabled at MPI startup. When there is  
> heterogeneity we cannot compare with shmem.
>
> I cannot imagine not having symmetric/collective memory object  
> creation to support these RMA interfaces, I think it is a must-have.  
> Sorry I have only been saying we should have these interfaces but  
> haven't given any example for this yet. Given how many times this  
> same issue is coming up, I will do it now.
>
> Consider the creation interfaces:
> Create_memobj(IN user_ptr, IN size, OUT mem_obj)
> Create_memobj_collective(user_ptr, size, OUT mem_obj)
> Assign_memobj(IN/OUT mem_obj, IN user_address, IN size)
>
> There will be more details on how a mem object which is a result of  
> create_memobj on process A will be exchanged with process B. When it  
> is exchanged explicitly, the heterogeneity information can be  
> created at process B.
>
> Now take the example with symmetric object:
>
> Process A
>
> myptr = allocate(mysize);
> Create_memobj_collective(myptr,mysize, all_obj);
> Do all kinds of RMA_Xfers
>
> and an example without symmetric object:
>
> myptr = allocate(mysize);
> Create_memobj(myptr,mysize,my_obj);
>  ----exchange objects here----
> do all kinds of RAM_Xfers
>
> In both cases, I can see being able to communicate without any cache  
> misses for mem_obj.
>
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> From: keith.d.underwood at intel.com
> To: mpi3-rma at lists.mpi-forum.org
> Date: Tue, 1 Sep 2009 09:07:41 -0600
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> If we take the SINGLE_RMA_INTERFACE_DRAFT_PROPOSAL as an example,  
> and combine it with the draft design goal #1: In order to support  
> RMA to arbitrary locations, no constraints on memory, such as  
> symmetric allocation or collective window creation, can be required
>
> We get an interesting view on how difficult it can be to get “close  
> to the metal”.  So, for MPI_RMA_xfer, we have to assume that the  
> user has some array of target_mem data items.  That means the  
> sequence of steps in user space is:
>
> target_mem = ranks[dest];
> MPI_RMA_xfer(… target_mem, dest…);
>
> If we assume that the message sizes are small and the destinations  
> randomly selected and the machine is large… every access to ranks is  
> a cache miss, and we cannot prevent that by providing fancy  
> hardware.  This actually leads me to believe that we may need to  
> reconsider design goal #1, or at least clarify what it means in a  
> way that makes the access more efficient.
>
> MPI_RMA_xfer itself is no picnic either.  If we take the draft  
> design goal #5: The RMA model must support non-cache-coherent and  
> heterogeneous environments, then MPI is required to maintain a data  
> structure for every rank (ok, it has to do this anyway, but we are  
> trying to get close to the metal) and do a lookup into that data  
> structure with every MPI_RMA_xfer to find out if the target is  
> heterogeneous relative to the target rank – another cache miss.   
> Now, nominally, since this is inside MPI, a lower layer could absorb  
> that check… or, a given MPI could refuse to support heterogeneity  
> or… but, you get the idea.
>
> So, we’ve got two cache line loads for every transfer.  One in the  
> application and one in the MPI library.  One is impossible to move  
> to the hardware and the other is simply very difficult to move.
>
> For a contrast, look at SHMEM.  Assume homogeneous, only one  
> communicator context, and hardware mapping of ranks to physical  
> locations.  A shmem_put() of a short item could literally be turned  
> into a few instructions and a processor store (on machines that  
> supported such things).  Personally, I think we will have done well  
> if we can get to the point that a reasonable hardware implementation  
> can get MPI RMA to within 2x of a reasonable SHMEM implementation.   
> I think we have a long way to go to get there.
>
> Keith
>
>
> From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-bounces at lists.mpi-forum.org 
> ] On Behalf Of Vinod tipparaju
> Sent: Tuesday, September 01, 2009 5:23 AM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
>
> Good points! RMA interfaces should do nothing to prevent utilizing a  
> high message rate (or low overhead communication) that the  
> underlying hardware may offer. To ensure this happens, there should  
> always be a unrestricted path (lets call it this for now, people  
> have called it a "thin layer", "direct access") to the network.
>
> This means, despite the fact the the RMA interface has features that  
> abstract out complexity by providing useful characteristics such as  
> ordering and atomicity, it (the RMA interface) should always have  
> this unrestricted, close to the heart of the hardware, path. To  
> achieve this, the unrestricted path should not require any book  
> keeping (from implementation perspective) in relation to the feature- 
> rich path or vice-versa.
>
> I believe this is what we have demonstrated with the example  
> interfaces hence the null set isn't the case here :-). I will  
> distribute an example implementation very soon so people can get a  
> feel.
>
> ---
> Vinod Tipparaju ^ http://ft.ornl.gov/~vinod ^ 1-865-241-1802
>
>
>
> > From: keith.d.underwood at intel.com
> > To: mpi3-rma at lists.mpi-forum.org
> > Date: Mon, 31 Aug 2009 16:17:28 -0600
> > Subject: Re: [Mpi3-rma] MPI3 RMA Design Goals
> >
> > There has been stunning silence since this email, so I will go  
> ahead and toss out a thought...
> >
> > In the draft design goals, I don't see two issues that I see as  
> key. The first is "support for high message rate/low overhead  
> communications to random targets". As best I can tell, this is one  
> of the key places were the existing one-sided operations are  
> perceived as falling down for existing customers of SHMEM/PGAS. The  
> second is "elimination of the access epoch requirement". This one  
> may be, um, more controversial, but I believe it is part and parcel  
> with the first one. That is, the first one is not that valuable if  
> the programming model requires an excessive amount of access epoch  
> opens and closes just to force the global visibility of the  
> operations. Unfortunately, the intersection of this solution space  
> with the solution space for the current draft design goal #5  
> (support non-cache-coherent and heterogeneous environments) may be  
> the null set... I will hold out hope that this isn't the case ;-)
> >
> > Keith
> >
> > > -----Original Message-----
> > > From: mpi3-rma-bounces at lists.mpi-forum.org [mailto:mpi3-rma-
> > > bounces at lists.mpi-forum.org] On Behalf Of William Gropp
> > > Sent: Wednesday, August 05, 2009 12:37 PM
> > > To: mpi3-rma at lists.mpi-forum.org
> > > Subject: [Mpi3-rma] MPI3 RMA Design Goals
> > >
> > > I've added versions of the RMA design goals that we discussed at  
> the
> > > Forum meeting last week to the wiki page for our group (
> > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/RmaWikiPage
> > > ). This is a draft; lets discuss these. Also, feel free to add to
> > > the discussion, particularly in the background section.
> > >
> > > Bill
> > >
> > > William Gropp
> > > Deputy Director for Research
> > > Institute for Advanced Computing Applications and Technologies
> > > Paul and Cynthia Saylor Professor of Computer Science
> > > University of Illinois Urbana-Champaign
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > mpi3-rma mailing list
> > > mpi3-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> <ATT00001.txt>

William Gropp
Deputy Director for Research
Institute for Advanced Computing Applications and Technologies
Paul and Cynthia Saylor Professor of Computer Science
University of Illinois Urbana-Champaign

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20090903/9983dbd1/attachment-0001.html>