[mpiwg-rma] shared-like access within a node with non-shared windows

Tue Oct 22 12:38:07 CDT 2013

Yes, please go ahead.  It seems like we should schedule a WG meeting some
time soon, to organize ourselves for December.

 ~Jim.

On Tue, Oct 22, 2013 at 11:22 AM, Jeff Hammond <jeff.science at gmail.com>wrote:

> Are we ready to make a ticket for this?  Seems like we have mostly
> converged.
>
> Jeff
>
> On Fri, Oct 18, 2013 at 6:07 PM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
> > I'm okay with that but it has to be clear that this is only the size
> > of the memory accessible by load-store and not the actual size of the
> > window at that rank.
> >
> > Jeff
> >
> > On Fri, Oct 18, 2013 at 5:40 PM, Jim Dinan <james.dinan at gmail.com>
> wrote:
> >> You can return a size of 0.
> >>
> >> Jim.
> >>
> >> On Oct 18, 2013 5:48 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
> >>>
> >>> Win shared query always returns a valid base for ranks in MPI comm
> >>> self, does it not?
> >>>
> >>> I don't like the error code approach. Can't we make a magic value
> >>> mpi_fortran_sucks_null?
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
> >>>
> >>> >
> >>> > Ok, can I rephrase what you want as follows --
> >>> >
> >>> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)
> >>> >
> >>> > This will return a window, where you *might* be able to do direct
> >>> > load/store to some of the remote process address spaces (let's call
> this
> >>> > "direct access memory").
> >>> >
> >>> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error
> >>> > code, as to whether a given remote process gives you direct access
> memory.
> >>> >
> >>> > An MPI implementation is allowed to ignore the info argument, in
> which
> >>> > case, it will give an error for MPI_WIN_SHARED_QUERY for all target
> >>> > processes.
> >>> >
> >>> > Does that sound right?
> >>> >
> >>> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +
> >>> > MPI_WIN_CREATE is that the MPI implementation can better allocate
> memory.
> >>> > For example, it might create symmetric address space across nodes
> and use
> >>> > shared memory within each node.  This is particularly useful when the
> >>> > allocation sizes on all processes is the same.
> >>> >
> >>> >  -- Pavan
> >>> >
> >>> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:
> >>> >
> >>> >> I believe that my assumptions about address translation are
> >>> >> appropriately conservative given the wide portability goals of MPI.
> Assuming
> >>> >> Cray Gemini or similar isn't reasonable.
> >>> >>
> >>> >> Exclusive lock isn't my primary interest but I'll pay the associated
> >>> >> costs as necessary.
> >>> >>
> >>> >> As discussed with Brian, this is a hint via info. Implementations
> can
> >>> >> skip it if they want. I merely want us to standardize the expanded
> use of
> >>> >> shared_query that allows this to work.
> >>> >>
> >>> >> Jeff
> >>> >>
> >>> >> Sent from my iPhone
> >>> >>
> >>> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com>
> wrote:
> >>> >>
> >>> >>> This is only correct if you assume that the remote NIC can't
> translate
> >>> >>> a displacement.
> >>> >>>
> >>> >>> Allowing all processes on the node direct access to the window
> buffer
> >>> >>> will require us to perform memory barriers in window
> synchronizations, and
> >>> >>> it will cause all lock operations that target the same node to
> block.  I
> >>> >>> understand the value in providing this usage model, but what will
> be the
> >>> >>> performance cost?
> >>> >>>
> >>> >>> ~Jim.
> >>> >>>
> >>> >>>
> >>> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <
> jeff.science at gmail.com>
> >>> >>> wrote:
> >>> >>> It is impossible to so O(1) state w create unless your force the
> >>> >>> remote side to do all the translation and thus it precludes RDMA.
> If you
> >>> >>> want RDMA impl, create require O(P) state. Allocate does not
> require it
> >>> >>>
> >>> >>> I believe all of this was thoroughly discussed when we proposed
> >>> >>> allocate.
> >>> >>>
> >>> >>> Sent from my iPhone
> >>> >>>
> >>> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com>
> wrote:
> >>> >>>
> >>> >>>> Why is MPI_Win_create not scalable?  There are certainly
> >>> >>>> implementations and use cases (e.g. not using different
> disp_units) that can
> >>> >>>> avoid O(P) metadata per process.  It's more likely that
> MPI_Win_allocate can
> >>> >>>> avoid these in more cases, but it's not guaranteed.  It seems
> like an
> >>> >>>> implementation could leverage the Win_allocate_share/Win_create
> combo to
> >>> >>>> achieve the same scaling result as MPI_Win_allocate.
> >>> >>>>
> >>> >>>> If we allow other window types to return pointers through
> >>> >>>> win_shared_query, then we will have to perform memory barriers in
> all of the
> >>> >>>> RMA synchronization routines all the time.
> >>> >>>>
> >>> >>>> ~Jim.
> >>> >>>>
> >>> >>>>
> >>> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond
> >>> >>>> <jeff.science at gmail.com> wrote:
> >>> >>>> Yes, as I said, that's all I can do right now.  But
> MPI_WIN_CREATE is
> >>> >>>> not scalable.  And it requires two windows instead of one.
> >>> >>>>
> >>> >>>> Brian, Pavan and Xin all seem to agree that this is
> straightforward
> >>> >>>> to
> >>> >>>> implement as an optional feature.  We just need to figure out how
> to
> >>> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
> >>> >>>>
> >>> >>>> Jeff
> >>> >>>>
> >>> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com
> >
> >>> >>>> wrote:
> >>> >>>>> Jeff,
> >>> >>>>>
> >>> >>>>> Sorry, I haven't read the whole thread closely, so please ignore
> me
> >>> >>>>> if this
> >>> >>>>> is nonsense.  Can you get what you want by doing
> >>> >>>>> MPI_Win_allocate_shared()
> >>> >>>>> to create an intranode window, and then pass the buffer
> allocated by
> >>> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an
> internode
> >>> >>>>> window?
> >>> >>>>>
> >>> >>>>> ~Jim.
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond
> >>> >>>>> <jeff.science at gmail.com>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better
> than
> >>> >>>>>> MPI_Win_create because the former allocates the shared memory
> >>> >>>>>> business.  It was implied that the latter requires more work
> within
> >>> >>>>>> the node. (I thought mmap could do the same magic on existing
> >>> >>>>>> allocations, but that's not really the point here.)
> >>> >>>>>>
> >>> >>>>>> But within a node, what's even better than a window allocated
> with
> >>> >>>>>> MPI_Win_allocate is a window allowed with
> MPI_Win_allocate_shared,
> >>> >>>>>> since the latter permits load-store.  Then I wondered if it
> would
> >>> >>>>>> be
> >>> >>>>>> possible to have both (1) direct load-store access within a node
> >>> >>>>>> and
> >>> >>>>>> (2) scalable metadata for windows spanning many nodes.
> >>> >>>>>>
> >>> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and
> then
> >>> >>>>>> dropping a second window for the internode part on top of these
> >>> >>>>>> using
> >>> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using
> >>> >>>>>> MPI_Win_allocate.
> >>> >>>>>>
> >>> >>>>>> I propose that it be possible to get (1) and (2) by allowing
> >>> >>>>>> MPI_Win_shared_query to return pointers to shared memory within
> a
> >>> >>>>>> node
> >>> >>>>>> even if the window has
> >>> >>>>>> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
> >>> >>>>>> When the input argument rank to MPI_Win_shared_query
> corresponds to
> >>> >>>>>> memory that is not accessible by load-store, the out arguments
> size
> >>> >>>>>> and baseptr are 0 and NULL, respectively.
> >>> >>>>>>
> >>> >>>>>> The non-scalable use of this feature would be to loop over all
> >>> >>>>>> ranks
> >>> >>>>>> in the group associated with the window and test for
> baseptr!=NULL,
> >>> >>>>>> while the scalable use would presumably utilize
> >>> >>>>>> MPI_Comm_split_type,
> >>> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the
> list
> >>> >>>>>> of
> >>> >>>>>> ranks corresponding to the node, hence the ones that might
> permit
> >>> >>>>>> direct access.
> >>> >>>>>>
> >>> >>>>>> Comments are appreciate.
> >>> >>>>>>
> >>> >>>>>> Jeff
> >>> >>>>>>
> >>> >>>>>> --
> >>> >>>>>> Jeff Hammond
> >>> >>>>>> jeff.science at gmail.com
> >>> >>>>>> _______________________________________________
> >>> >>>>>> mpiwg-rma mailing list
> >>> >>>>>> mpiwg-rma at lists.mpi-forum.org
> >>> >>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> _______________________________________________
> >>> >>>>> mpiwg-rma mailing list
> >>> >>>>> mpiwg-rma at lists.mpi-forum.org
> >>> >>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> --
> >>> >>>> Jeff Hammond
> >>> >>>> jeff.science at gmail.com
> >>> >>>> _______________________________________________
> >>> >>>> mpiwg-rma mailing list
> >>> >>>> mpiwg-rma at lists.mpi-forum.org
> >>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >>>>
> >>> >>>> _______________________________________________
> >>> >>>> mpiwg-rma mailing list
> >>> >>>> mpiwg-rma at lists.mpi-forum.org
> >>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >>>
> >>> >>> _______________________________________________
> >>> >>> mpiwg-rma mailing list
> >>> >>> mpiwg-rma at lists.mpi-forum.org
> >>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >>>
> >>> >>> _______________________________________________
> >>> >>> mpiwg-rma mailing list
> >>> >>> mpiwg-rma at lists.mpi-forum.org
> >>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >> _______________________________________________
> >>> >> mpiwg-rma mailing list
> >>> >> mpiwg-rma at lists.mpi-forum.org
> >>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> >
> >>> > --
> >>> > Pavan Balaji
> >>> > http://www.mcs.anl.gov/~balaji
> >>> >
> >>> > _______________________________________________
> >>> > mpiwg-rma mailing list
> >>> > mpiwg-rma at lists.mpi-forum.org
> >>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>> _______________________________________________
> >>> mpiwg-rma mailing list
> >>> mpiwg-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>
> >>
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> >
> >
> > --
> > Jeff Hammond
> > jeff.science at gmail.com
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20131022/d586004d/attachment-0001.html>