[mpiwg-rma] shared-like access within a node with non-shared windows

Jim Dinan james.dinan at gmail.com
Fri Oct 18 17:40:17 CDT 2013


You can return a size of 0.

Jim.
On Oct 18, 2013 5:48 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:

> Win shared query always returns a valid base for ranks in MPI comm
> self, does it not?
>
> I don't like the error code approach. Can't we make a magic value
> mpi_fortran_sucks_null?
>
> Sent from my iPhone
>
> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> >
> > Ok, can I rephrase what you want as follows --
> >
> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)
> >
> > This will return a window, where you *might* be able to do direct
> load/store to some of the remote process address spaces (let's call this
> "direct access memory").
> >
> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error
> code, as to whether a given remote process gives you direct access memory.
> >
> > An MPI implementation is allowed to ignore the info argument, in which
> case, it will give an error for MPI_WIN_SHARED_QUERY for all target
> processes.
> >
> > Does that sound right?
> >
> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +
> MPI_WIN_CREATE is that the MPI implementation can better allocate memory.
>  For example, it might create symmetric address space across nodes and use
> shared memory within each node.  This is particularly useful when the
> allocation sizes on all processes is the same.
> >
> >  -- Pavan
> >
> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:
> >
> >> I believe that my assumptions about address translation are
> appropriately conservative given the wide portability goals of MPI.
> Assuming Cray Gemini or similar isn't reasonable.
> >>
> >> Exclusive lock isn't my primary interest but I'll pay the associated
> costs as necessary.
> >>
> >> As discussed with Brian, this is a hint via info. Implementations can
> skip it if they want. I merely want us to standardize the expanded use of
> shared_query that allows this to work.
> >>
> >> Jeff
> >>
> >> Sent from my iPhone
> >>
> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> >>
> >>> This is only correct if you assume that the remote NIC can't translate
> a displacement.
> >>>
> >>> Allowing all processes on the node direct access to the window buffer
> will require us to perform memory barriers in window synchronizations, and
> it will cause all lock operations that target the same node to block.  I
> understand the value in providing this usage model, but what will be the
> performance cost?
> >>>
> >>> ~Jim.
> >>>
> >>>
> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
> >>> It is impossible to so O(1) state w create unless your force the
> remote side to do all the translation and thus it precludes RDMA. If you
> want RDMA impl, create require O(P) state. Allocate does not require it
> >>>
> >>> I believe all of this was thoroughly discussed when we proposed
> allocate.
> >>>
> >>> Sent from my iPhone
> >>>
> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> >>>
> >>>> Why is MPI_Win_create not scalable?  There are certainly
> implementations and use cases (e.g. not using different disp_units) that
> can avoid O(P) metadata per process.  It's more likely that
> MPI_Win_allocate can avoid these in more cases, but it's not guaranteed.
>  It seems like an implementation could leverage the
> Win_allocate_share/Win_create combo to achieve the same scaling result as
> MPI_Win_allocate.
> >>>>
> >>>> If we allow other window types to return pointers through
> win_shared_query, then we will have to perform memory barriers in all of
> the RMA synchronization routines all the time.
> >>>>
> >>>> ~Jim.
> >>>>
> >>>>
> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
> >>>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
> >>>> not scalable.  And it requires two windows instead of one.
> >>>>
> >>>> Brian, Pavan and Xin all seem to agree that this is straightforward to
> >>>> implement as an optional feature.  We just need to figure out how to
> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
> >>>>
> >>>> Jeff
> >>>>
> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com>
> wrote:
> >>>>> Jeff,
> >>>>>
> >>>>> Sorry, I haven't read the whole thread closely, so please ignore me
> if this
> >>>>> is nonsense.  Can you get what you want by doing
> MPI_Win_allocate_shared()
> >>>>> to create an intranode window, and then pass the buffer allocated by
> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an internode
> window?
> >>>>>
> >>>>> ~Jim.
> >>>>>
> >>>>>
> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <
> jeff.science at gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
> >>>>>> MPI_Win_create because the former allocates the shared memory
> >>>>>> business.  It was implied that the latter requires more work within
> >>>>>> the node. (I thought mmap could do the same magic on existing
> >>>>>> allocations, but that's not really the point here.)
> >>>>>>
> >>>>>> But within a node, what's even better than a window allocated with
> >>>>>> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
> >>>>>> since the latter permits load-store.  Then I wondered if it would be
> >>>>>> possible to have both (1) direct load-store access within a node and
> >>>>>> (2) scalable metadata for windows spanning many nodes.
> >>>>>>
> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
> >>>>>> dropping a second window for the internode part on top of these
> using
> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using
> >>>>>> MPI_Win_allocate.
> >>>>>>
> >>>>>> I propose that it be possible to get (1) and (2) by allowing
> >>>>>> MPI_Win_shared_query to return pointers to shared memory within a
> node
> >>>>>> even if the window has
> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
> >>>>>> When the input argument rank to MPI_Win_shared_query corresponds to
> >>>>>> memory that is not accessible by load-store, the out arguments size
> >>>>>> and baseptr are 0 and NULL, respectively.
> >>>>>>
> >>>>>> The non-scalable use of this feature would be to loop over all ranks
> >>>>>> in the group associated with the window and test for baseptr!=NULL,
> >>>>>> while the scalable use would presumably utilize MPI_Comm_split_type,
> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the list
> of
> >>>>>> ranks corresponding to the node, hence the ones that might permit
> >>>>>> direct access.
> >>>>>>
> >>>>>> Comments are appreciate.
> >>>>>>
> >>>>>> Jeff
> >>>>>>
> >>>>>> --
> >>>>>> Jeff Hammond
> >>>>>> jeff.science at gmail.com
> >>>>>> _______________________________________________
> >>>>>> mpiwg-rma mailing list
> >>>>>> mpiwg-rma at lists.mpi-forum.org
> >>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> mpiwg-rma mailing list
> >>>>> mpiwg-rma at lists.mpi-forum.org
> >>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jeff Hammond
> >>>> jeff.science at gmail.com
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>>
> >>>> _______________________________________________
> >>>> mpiwg-rma mailing list
> >>>> mpiwg-rma at lists.mpi-forum.org
> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>
> >>> _______________________________________________
> >>> mpiwg-rma mailing list
> >>> mpiwg-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >>>
> >>> _______________________________________________
> >>> mpiwg-rma mailing list
> >>> mpiwg-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> > --
> > Pavan Balaji
> > http://www.mcs.anl.gov/~balaji
> >
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20131018/5a176582/attachment-0001.html>


More information about the mpiwg-rma mailing list