[mpiwg-rma] shared-like access within a node with non-shared windows
james.dinan at gmail.com
Fri Oct 18 15:16:54 CDT 2013
Why is MPI_Win_create not scalable? There are certainly implementations
and use cases (e.g. not using different disp_units) that can avoid O(P)
metadata per process. It's more likely that MPI_Win_allocate can avoid
these in more cases, but it's not guaranteed. It seems like an
implementation could leverage the Win_allocate_share/Win_create combo to
achieve the same scaling result as MPI_Win_allocate.
If we allow other window types to return pointers through win_shared_query,
then we will have to perform memory barriers in all of the RMA
synchronization routines all the time.
On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <jeff.science at gmail.com>wrote:
> Yes, as I said, that's all I can do right now. But MPI_WIN_CREATE is
> not scalable. And it requires two windows instead of one.
> Brian, Pavan and Xin all seem to agree that this is straightforward to
> implement as an optional feature. We just need to figure out how to
> extend the use of MPI_WIN_SHARED_QUERY to enable it.
> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> > Jeff,
> > Sorry, I haven't read the whole thread closely, so please ignore me if
> > is nonsense. Can you get what you want by doing
> > to create an intranode window, and then pass the buffer allocated by
> > MPI_Win_allocate_shared to MPI_Win_create() to create an internode
> > ~Jim.
> > On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <jeff.science at gmail.com>
> > wrote:
> >> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
> >> MPI_Win_create because the former allocates the shared memory
> >> business. It was implied that the latter requires more work within
> >> the node. (I thought mmap could do the same magic on existing
> >> allocations, but that's not really the point here.)
> >> But within a node, what's even better than a window allocated with
> >> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
> >> since the latter permits load-store. Then I wondered if it would be
> >> possible to have both (1) direct load-store access within a node and
> >> (2) scalable metadata for windows spanning many nodes.
> >> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
> >> dropping a second window for the internode part on top of these using
> >> MPI_Win_create. Of course, I can get (2) but not (1) using
> >> MPI_Win_allocate.
> >> I propose that it be possible to get (1) and (2) by allowing
> >> MPI_Win_shared_query to return pointers to shared memory within a node
> >> even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
> >> When the input argument rank to MPI_Win_shared_query corresponds to
> >> memory that is not accessible by load-store, the out arguments size
> >> and baseptr are 0 and NULL, respectively.
> >> The non-scalable use of this feature would be to loop over all ranks
> >> in the group associated with the window and test for baseptr!=NULL,
> >> while the scalable use would presumably utilize MPI_Comm_split_type,
> >> MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
> >> ranks corresponding to the node, hence the ones that might permit
> >> direct access.
> >> Comments are appreciate.
> >> Jeff
> >> --
> >> Jeff Hammond
> >> jeff.science at gmail.com
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> Jeff Hammond
> jeff.science at gmail.com
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-rma