[mpiwg-rma] shared-like access within a node with non-shared windows

Jim Dinan james.dinan at gmail.com
Fri Oct 18 16:25:12 CDT 2013


This is only correct if you assume that the remote NIC can't translate a
displacement.

Allowing all processes on the node direct access to the window buffer will
require us to perform memory barriers in window synchronizations, and it
will cause all lock operations that target the same node to block.  I
understand the value in providing this usage model, but what will be the
performance cost?

 ~Jim.


On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <jeff.science at gmail.com>wrote:

> It is impossible to so O(1) state w create unless your force the remote
> side to do all the translation and thus it precludes RDMA. If you want RDMA
> impl, create require O(P) state. Allocate does not require it
>
> I believe all of this was thoroughly discussed when we proposed allocate.
>
> Sent from my iPhone
>
> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>
> Why is MPI_Win_create not scalable?  There are certainly implementations
> and use cases (e.g. not using different disp_units) that can avoid O(P)
> metadata per process.  It's more likely that MPI_Win_allocate can avoid
> these in more cases, but it's not guaranteed.  It seems like an
> implementation could leverage the Win_allocate_share/Win_create combo to
> achieve the same scaling result as MPI_Win_allocate.
>
> If we allow other window types to return pointers through
> win_shared_query, then we will have to perform memory barriers in all of
> the RMA synchronization routines all the time.
>
>  ~Jim.
>
>
> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <jeff.science at gmail.com>wrote:
>
>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
>> not scalable.  And it requires two windows instead of one.
>>
>> Brian, Pavan and Xin all seem to agree that this is straightforward to
>> implement as an optional feature.  We just need to figure out how to
>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
>>
>> Jeff
>>
>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>> > Jeff,
>> >
>> > Sorry, I haven't read the whole thread closely, so please ignore me if
>> this
>> > is nonsense.  Can you get what you want by doing
>> MPI_Win_allocate_shared()
>> > to create an intranode window, and then pass the buffer allocated by
>> > MPI_Win_allocate_shared to MPI_Win_create() to create an internode
>> window?
>> >
>> >  ~Jim.
>> >
>> >
>> > On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <jeff.science at gmail.com>
>> > wrote:
>> >>
>> >> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>> >> MPI_Win_create because the former allocates the shared memory
>> >> business.  It was implied that the latter requires more work within
>> >> the node. (I thought mmap could do the same magic on existing
>> >> allocations, but that's not really the point here.)
>> >>
>> >> But within a node, what's even better than a window allocated with
>> >> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>> >> since the latter permits load-store.  Then I wondered if it would be
>> >> possible to have both (1) direct load-store access within a node and
>> >> (2) scalable metadata for windows spanning many nodes.
>> >>
>> >> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>> >> dropping a second window for the internode part on top of these using
>> >> MPI_Win_create.  Of course, I can get (2) but not (1) using
>> >> MPI_Win_allocate.
>> >>
>> >> I propose that it be possible to get (1) and (2) by allowing
>> >> MPI_Win_shared_query to return pointers to shared memory within a node
>> >> even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>> >> When the input argument rank to MPI_Win_shared_query corresponds to
>> >> memory that is not accessible by load-store, the out arguments size
>> >> and baseptr are 0 and NULL, respectively.
>> >>
>> >> The non-scalable use of this feature would be to loop over all ranks
>> >> in the group associated with the window and test for baseptr!=NULL,
>> >> while the scalable use would presumably utilize MPI_Comm_split_type,
>> >> MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>> >> ranks corresponding to the node, hence the ones that might permit
>> >> direct access.
>> >>
>> >> Comments are appreciate.
>> >>
>> >> Jeff
>> >>
>> >> --
>> >> Jeff Hammond
>> >> jeff.science at gmail.com
>> >> _______________________________________________
>> >> mpiwg-rma mailing list
>> >> mpiwg-rma at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >
>> >
>> >
>> > _______________________________________________
>> > mpiwg-rma mailing list
>> > mpiwg-rma at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20131018/a56e102e/attachment-0001.html>


More information about the mpiwg-rma mailing list