[mpiwg-rma] shared-like access within a node with non-shared windows

Pavan Balaji balaji at mcs.anl.gov
Fri Oct 18 16:40:00 CDT 2013


Ok, can I rephrase what you want as follows --

MPI_WIN_ALLOCATE(info = gimme_shared_memory)

This will return a window, where you *might* be able to do direct load/store to some of the remote process address spaces (let's call this "direct access memory").

MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error code, as to whether a given remote process gives you direct access memory.

An MPI implementation is allowed to ignore the info argument, in which case, it will give an error for MPI_WIN_SHARED_QUERY for all target processes.

Does that sound right?

I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED + MPI_WIN_CREATE is that the MPI implementation can better allocate memory.  For example, it might create symmetric address space across nodes and use shared memory within each node.  This is particularly useful when the allocation sizes on all processes is the same.

  -- Pavan

On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:

> I believe that my assumptions about address translation are appropriately conservative given the wide portability goals of MPI. Assuming Cray Gemini or similar isn't reasonable. 
> 
> Exclusive lock isn't my primary interest but I'll pay the associated costs as necessary. 
> 
> As discussed with Brian, this is a hint via info. Implementations can skip it if they want. I merely want us to standardize the expanded use of shared_query that allows this to work. 
> 
> Jeff
> 
> Sent from my iPhone
> 
> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> 
>> This is only correct if you assume that the remote NIC can't translate a displacement.
>> 
>> Allowing all processes on the node direct access to the window buffer will require us to perform memory barriers in window synchronizations, and it will cause all lock operations that target the same node to block.  I understand the value in providing this usage model, but what will be the performance cost?
>> 
>>  ~Jim.
>> 
>> 
>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>> It is impossible to so O(1) state w create unless your force the remote side to do all the translation and thus it precludes RDMA. If you want RDMA impl, create require O(P) state. Allocate does not require it
>> 
>> I believe all of this was thoroughly discussed when we proposed allocate. 
>> 
>> Sent from my iPhone
>> 
>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>> 
>>> Why is MPI_Win_create not scalable?  There are certainly implementations and use cases (e.g. not using different disp_units) that can avoid O(P) metadata per process.  It's more likely that MPI_Win_allocate can avoid these in more cases, but it's not guaranteed.  It seems like an implementation could leverage the Win_allocate_share/Win_create combo to achieve the same scaling result as MPI_Win_allocate.
>>> 
>>> If we allow other window types to return pointers through win_shared_query, then we will have to perform memory barriers in all of the RMA synchronization routines all the time.
>>> 
>>>  ~Jim.
>>> 
>>> 
>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
>>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
>>> not scalable.  And it requires two windows instead of one.
>>> 
>>> Brian, Pavan and Xin all seem to agree that this is straightforward to
>>> implement as an optional feature.  We just need to figure out how to
>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
>>> 
>>> Jeff
>>> 
>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>> > Jeff,
>>> >
>>> > Sorry, I haven't read the whole thread closely, so please ignore me if this
>>> > is nonsense.  Can you get what you want by doing MPI_Win_allocate_shared()
>>> > to create an intranode window, and then pass the buffer allocated by
>>> > MPI_Win_allocate_shared to MPI_Win_create() to create an internode window?
>>> >
>>> >  ~Jim.
>>> >
>>> >
>>> > On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <jeff.science at gmail.com>
>>> > wrote:
>>> >>
>>> >> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>>> >> MPI_Win_create because the former allocates the shared memory
>>> >> business.  It was implied that the latter requires more work within
>>> >> the node. (I thought mmap could do the same magic on existing
>>> >> allocations, but that's not really the point here.)
>>> >>
>>> >> But within a node, what's even better than a window allocated with
>>> >> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>>> >> since the latter permits load-store.  Then I wondered if it would be
>>> >> possible to have both (1) direct load-store access within a node and
>>> >> (2) scalable metadata for windows spanning many nodes.
>>> >>
>>> >> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>>> >> dropping a second window for the internode part on top of these using
>>> >> MPI_Win_create.  Of course, I can get (2) but not (1) using
>>> >> MPI_Win_allocate.
>>> >>
>>> >> I propose that it be possible to get (1) and (2) by allowing
>>> >> MPI_Win_shared_query to return pointers to shared memory within a node
>>> >> even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>>> >> When the input argument rank to MPI_Win_shared_query corresponds to
>>> >> memory that is not accessible by load-store, the out arguments size
>>> >> and baseptr are 0 and NULL, respectively.
>>> >>
>>> >> The non-scalable use of this feature would be to loop over all ranks
>>> >> in the group associated with the window and test for baseptr!=NULL,
>>> >> while the scalable use would presumably utilize MPI_Comm_split_type,
>>> >> MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>>> >> ranks corresponding to the node, hence the ones that might permit
>>> >> direct access.
>>> >>
>>> >> Comments are appreciate.
>>> >>
>>> >> Jeff
>>> >>
>>> >> --
>>> >> Jeff Hammond
>>> >> jeff.science at gmail.com
>>> >> _______________________________________________
>>> >> mpiwg-rma mailing list
>>> >> mpiwg-rma at lists.mpi-forum.org
>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > mpiwg-rma mailing list
>>> > mpiwg-rma at lists.mpi-forum.org
>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> 
>>> 
>>> 
>>> --
>>> Jeff Hammond
>>> jeff.science at gmail.com
>>> _______________________________________________
>>> mpiwg-rma mailing list
>>> mpiwg-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> 
>>> _______________________________________________
>>> mpiwg-rma mailing list
>>> mpiwg-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> 
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> 
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji




More information about the mpiwg-rma mailing list