[mpiwg-rma] shared-like access within a node with non-shared windows

Tue Oct 22 13:16:06 CDT 2013

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/397

On Tue, Oct 22, 2013 at 12:38 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> Yes, please go ahead.  It seems like we should schedule a WG meeting some
> time soon, to organize ourselves for December.
>
>  ~Jim.
>
>
> On Tue, Oct 22, 2013 at 11:22 AM, Jeff Hammond <jeff.science at gmail.com>
> wrote:
>>
>> Are we ready to make a ticket for this?  Seems like we have mostly
>> converged.
>>
>> Jeff
>>
>> On Fri, Oct 18, 2013 at 6:07 PM, Jeff Hammond <jeff.science at gmail.com>
>> wrote:
>> > I'm okay with that but it has to be clear that this is only the size
>> > of the memory accessible by load-store and not the actual size of the
>> > window at that rank.
>> >
>> > Jeff
>> >
>> > On Fri, Oct 18, 2013 at 5:40 PM, Jim Dinan <james.dinan at gmail.com>
>> > wrote:
>> >> You can return a size of 0.
>> >>
>> >> Jim.
>> >>
>> >> On Oct 18, 2013 5:48 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>> >>>
>> >>> Win shared query always returns a valid base for ranks in MPI comm
>> >>> self, does it not?
>> >>>
>> >>> I don't like the error code approach. Can't we make a magic value
>> >>> mpi_fortran_sucks_null?
>> >>>
>> >>> Sent from my iPhone
>> >>>
>> >>> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>> >>>
>> >>> >
>> >>> > Ok, can I rephrase what you want as follows --
>> >>> >
>> >>> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)
>> >>> >
>> >>> > This will return a window, where you *might* be able to do direct
>> >>> > load/store to some of the remote process address spaces (let's call
>> >>> > this
>> >>> > "direct access memory").
>> >>> >
>> >>> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate
>> >>> > error
>> >>> > code, as to whether a given remote process gives you direct access
>> >>> > memory.
>> >>> >
>> >>> > An MPI implementation is allowed to ignore the info argument, in
>> >>> > which
>> >>> > case, it will give an error for MPI_WIN_SHARED_QUERY for all target
>> >>> > processes.
>> >>> >
>> >>> > Does that sound right?
>> >>> >
>> >>> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +
>> >>> > MPI_WIN_CREATE is that the MPI implementation can better allocate
>> >>> > memory.
>> >>> > For example, it might create symmetric address space across nodes
>> >>> > and use
>> >>> > shared memory within each node.  This is particularly useful when
>> >>> > the
>> >>> > allocation sizes on all processes is the same.
>> >>> >
>> >>> >  -- Pavan
>> >>> >
>> >>> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:
>> >>> >
>> >>> >> I believe that my assumptions about address translation are
>> >>> >> appropriately conservative given the wide portability goals of MPI.
>> >>> >> Assuming
>> >>> >> Cray Gemini or similar isn't reasonable.
>> >>> >>
>> >>> >> Exclusive lock isn't my primary interest but I'll pay the
>> >>> >> associated
>> >>> >> costs as necessary.
>> >>> >>
>> >>> >> As discussed with Brian, this is a hint via info. Implementations
>> >>> >> can
>> >>> >> skip it if they want. I merely want us to standardize the expanded
>> >>> >> use of
>> >>> >> shared_query that allows this to work.
>> >>> >>
>> >>> >> Jeff
>> >>> >>
>> >>> >> Sent from my iPhone
>> >>> >>
>> >>> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >>> This is only correct if you assume that the remote NIC can't
>> >>> >>> translate
>> >>> >>> a displacement.
>> >>> >>>
>> >>> >>> Allowing all processes on the node direct access to the window
>> >>> >>> buffer
>> >>> >>> will require us to perform memory barriers in window
>> >>> >>> synchronizations, and
>> >>> >>> it will cause all lock operations that target the same node to
>> >>> >>> block.  I
>> >>> >>> understand the value in providing this usage model, but what will
>> >>> >>> be the
>> >>> >>> performance cost?
>> >>> >>>
>> >>> >>> ~Jim.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond
>> >>> >>> <jeff.science at gmail.com>
>> >>> >>> wrote:
>> >>> >>> It is impossible to so O(1) state w create unless your force the
>> >>> >>> remote side to do all the translation and thus it precludes RDMA.
>> >>> >>> If you
>> >>> >>> want RDMA impl, create require O(P) state. Allocate does not
>> >>> >>> require it
>> >>> >>>
>> >>> >>> I believe all of this was thoroughly discussed when we proposed
>> >>> >>> allocate.
>> >>> >>>
>> >>> >>> Sent from my iPhone
>> >>> >>>
>> >>> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com>
>> >>> >>> wrote:
>> >>> >>>
>> >>> >>>> Why is MPI_Win_create not scalable?  There are certainly
>> >>> >>>> implementations and use cases (e.g. not using different
>> >>> >>>> disp_units) that can
>> >>> >>>> avoid O(P) metadata per process.  It's more likely that
>> >>> >>>> MPI_Win_allocate can
>> >>> >>>> avoid these in more cases, but it's not guaranteed.  It seems
>> >>> >>>> like an
>> >>> >>>> implementation could leverage the Win_allocate_share/Win_create
>> >>> >>>> combo to
>> >>> >>>> achieve the same scaling result as MPI_Win_allocate.
>> >>> >>>>
>> >>> >>>> If we allow other window types to return pointers through
>> >>> >>>> win_shared_query, then we will have to perform memory barriers in
>> >>> >>>> all of the
>> >>> >>>> RMA synchronization routines all the time.
>> >>> >>>>
>> >>> >>>> ~Jim.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond
>> >>> >>>> <jeff.science at gmail.com> wrote:
>> >>> >>>> Yes, as I said, that's all I can do right now.  But
>> >>> >>>> MPI_WIN_CREATE is
>> >>> >>>> not scalable.  And it requires two windows instead of one.
>> >>> >>>>
>> >>> >>>> Brian, Pavan and Xin all seem to agree that this is
>> >>> >>>> straightforward
>> >>> >>>> to
>> >>> >>>> implement as an optional feature.  We just need to figure out how
>> >>> >>>> to
>> >>> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
>> >>> >>>>
>> >>> >>>> Jeff
>> >>> >>>>
>> >>> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan
>> >>> >>>> <james.dinan at gmail.com>
>> >>> >>>> wrote:
>> >>> >>>>> Jeff,
>> >>> >>>>>
>> >>> >>>>> Sorry, I haven't read the whole thread closely, so please ignore
>> >>> >>>>> me
>> >>> >>>>> if this
>> >>> >>>>> is nonsense.  Can you get what you want by doing
>> >>> >>>>> MPI_Win_allocate_shared()
>> >>> >>>>> to create an intranode window, and then pass the buffer
>> >>> >>>>> allocated by
>> >>> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an
>> >>> >>>>> internode
>> >>> >>>>> window?
>> >>> >>>>>
>> >>> >>>>> ~Jim.
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond
>> >>> >>>>> <jeff.science at gmail.com>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better
>> >>> >>>>>> than
>> >>> >>>>>> MPI_Win_create because the former allocates the shared memory
>> >>> >>>>>> business.  It was implied that the latter requires more work
>> >>> >>>>>> within
>> >>> >>>>>> the node. (I thought mmap could do the same magic on existing
>> >>> >>>>>> allocations, but that's not really the point here.)
>> >>> >>>>>>
>> >>> >>>>>> But within a node, what's even better than a window allocated
>> >>> >>>>>> with
>> >>> >>>>>> MPI_Win_allocate is a window allowed with
>> >>> >>>>>> MPI_Win_allocate_shared,
>> >>> >>>>>> since the latter permits load-store.  Then I wondered if it
>> >>> >>>>>> would
>> >>> >>>>>> be
>> >>> >>>>>> possible to have both (1) direct load-store access within a
>> >>> >>>>>> node
>> >>> >>>>>> and
>> >>> >>>>>> (2) scalable metadata for windows spanning many nodes.
>> >>> >>>>>>
>> >>> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and
>> >>> >>>>>> then
>> >>> >>>>>> dropping a second window for the internode part on top of these
>> >>> >>>>>> using
>> >>> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using
>> >>> >>>>>> MPI_Win_allocate.
>> >>> >>>>>>
>> >>> >>>>>> I propose that it be possible to get (1) and (2) by allowing
>> >>> >>>>>> MPI_Win_shared_query to return pointers to shared memory within
>> >>> >>>>>> a
>> >>> >>>>>> node
>> >>> >>>>>> even if the window has
>> >>> >>>>>> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>> >>> >>>>>> When the input argument rank to MPI_Win_shared_query
>> >>> >>>>>> corresponds to
>> >>> >>>>>> memory that is not accessible by load-store, the out arguments
>> >>> >>>>>> size
>> >>> >>>>>> and baseptr are 0 and NULL, respectively.
>> >>> >>>>>>
>> >>> >>>>>> The non-scalable use of this feature would be to loop over all
>> >>> >>>>>> ranks
>> >>> >>>>>> in the group associated with the window and test for
>> >>> >>>>>> baseptr!=NULL,
>> >>> >>>>>> while the scalable use would presumably utilize
>> >>> >>>>>> MPI_Comm_split_type,
>> >>> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the
>> >>> >>>>>> list
>> >>> >>>>>> of
>> >>> >>>>>> ranks corresponding to the node, hence the ones that might
>> >>> >>>>>> permit
>> >>> >>>>>> direct access.
>> >>> >>>>>>
>> >>> >>>>>> Comments are appreciate.
>> >>> >>>>>>
>> >>> >>>>>> Jeff
>> >>> >>>>>>
>> >>> >>>>>> --
>> >>> >>>>>> Jeff Hammond
>> >>> >>>>>> jeff.science at gmail.com
>> >>> >>>>>> _______________________________________________
>> >>> >>>>>> mpiwg-rma mailing list
>> >>> >>>>>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> _______________________________________________
>> >>> >>>>> mpiwg-rma mailing list
>> >>> >>>>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> --
>> >>> >>>> Jeff Hammond
>> >>> >>>> jeff.science at gmail.com
>> >>> >>>> _______________________________________________
>> >>> >>>> mpiwg-rma mailing list
>> >>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >>>>
>> >>> >>>> _______________________________________________
>> >>> >>>> mpiwg-rma mailing list
>> >>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >>>
>> >>> >>> _______________________________________________
>> >>> >>> mpiwg-rma mailing list
>> >>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >>>
>> >>> >>> _______________________________________________
>> >>> >>> mpiwg-rma mailing list
>> >>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >> _______________________________________________
>> >>> >> mpiwg-rma mailing list
>> >>> >> mpiwg-rma at lists.mpi-forum.org
>> >>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> >
>> >>> > --
>> >>> > Pavan Balaji
>> >>> > http://www.mcs.anl.gov/~balaji
>> >>> >
>> >>> > _______________________________________________
>> >>> > mpiwg-rma mailing list
>> >>> > mpiwg-rma at lists.mpi-forum.org
>> >>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>> _______________________________________________
>> >>> mpiwg-rma mailing list
>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>
>> >>
>> >> _______________________________________________
>> >> mpiwg-rma mailing list
>> >> mpiwg-rma at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >
>> >
>> >
>> > --
>> > Jeff Hammond
>> > jeff.science at gmail.com
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma


-- 
Jeff Hammond
jeff.science at gmail.com