[mpiwg-rma] shared-like access within a node with non-shared windows

Jim Dinan james.dinan at gmail.com
Fri Oct 18 15:21:34 CDT 2013


This would also change the locking semantic.  If the implementation permits
load/store to the given target, the lock function cannot return until the
lock has been granted.

 ~Jim.


On Fri, Oct 18, 2013 at 4:19 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

>
> A side question --
>
> I was trying to see what semantic benefit this would provide compared to
> two overlapping windows, and the only thing I could come up with was
> exclusive locks.  If we have two windows exclusive locks are not going to
> protect accesses effectively.  Is that the only use case?  What else am I
> missing?  Note that we are restricted to UNIFIED here since that's where
> WIN_ALLOCATE_SHARED is defined.
>
>   -- Pavan
>
> On Oct 18, 2013, at 3:16 PM, Jim Dinan wrote:
>
> > Why is MPI_Win_create not scalable?  There are certainly implementations
> and use cases (e.g. not using different disp_units) that can avoid O(P)
> metadata per process.  It's more likely that MPI_Win_allocate can avoid
> these in more cases, but it's not guaranteed.  It seems like an
> implementation could leverage the Win_allocate_share/Win_create combo to
> achieve the same scaling result as MPI_Win_allocate.
> >
> > If we allow other window types to return pointers through
> win_shared_query, then we will have to perform memory barriers in all of
> the RMA synchronization routines all the time.
> >
> >  ~Jim.
> >
> >
> > On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
> > Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
> > not scalable.  And it requires two windows instead of one.
> >
> > Brian, Pavan and Xin all seem to agree that this is straightforward to
> > implement as an optional feature.  We just need to figure out how to
> > extend the use of MPI_WIN_SHARED_QUERY to enable it.
> >
> > Jeff
> >
> > On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com>
> wrote:
> > > Jeff,
> > >
> > > Sorry, I haven't read the whole thread closely, so please ignore me if
> this
> > > is nonsense.  Can you get what you want by doing
> MPI_Win_allocate_shared()
> > > to create an intranode window, and then pass the buffer allocated by
> > > MPI_Win_allocate_shared to MPI_Win_create() to create an internode
> window?
> > >
> > >  ~Jim.
> > >
> > >
> > > On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <jeff.science at gmail.com>
> > > wrote:
> > >>
> > >> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
> > >> MPI_Win_create because the former allocates the shared memory
> > >> business.  It was implied that the latter requires more work within
> > >> the node. (I thought mmap could do the same magic on existing
> > >> allocations, but that's not really the point here.)
> > >>
> > >> But within a node, what's even better than a window allocated with
> > >> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
> > >> since the latter permits load-store.  Then I wondered if it would be
> > >> possible to have both (1) direct load-store access within a node and
> > >> (2) scalable metadata for windows spanning many nodes.
> > >>
> > >> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
> > >> dropping a second window for the internode part on top of these using
> > >> MPI_Win_create.  Of course, I can get (2) but not (1) using
> > >> MPI_Win_allocate.
> > >>
> > >> I propose that it be possible to get (1) and (2) by allowing
> > >> MPI_Win_shared_query to return pointers to shared memory within a node
> > >> even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
> > >> When the input argument rank to MPI_Win_shared_query corresponds to
> > >> memory that is not accessible by load-store, the out arguments size
> > >> and baseptr are 0 and NULL, respectively.
> > >>
> > >> The non-scalable use of this feature would be to loop over all ranks
> > >> in the group associated with the window and test for baseptr!=NULL,
> > >> while the scalable use would presumably utilize MPI_Comm_split_type,
> > >> MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
> > >> ranks corresponding to the node, hence the ones that might permit
> > >> direct access.
> > >>
> > >> Comments are appreciate.
> > >>
> > >> Jeff
> > >>
> > >> --
> > >> Jeff Hammond
> > >> jeff.science at gmail.com
> > >> _______________________________________________
> > >> mpiwg-rma mailing list
> > >> mpiwg-rma at lists.mpi-forum.org
> > >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> > >
> > >
> > >
> > > _______________________________________________
> > > mpiwg-rma mailing list
> > > mpiwg-rma at lists.mpi-forum.org
> > > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20131018/ff6aee4d/attachment-0001.html>


More information about the mpiwg-rma mailing list