[mpiwg-rma] shared-like access within a node with non-shared windows

Jeff Hammond jeff.science at gmail.com
Fri Oct 18 18:07:00 CDT 2013


I'm okay with that but it has to be clear that this is only the size
of the memory accessible by load-store and not the actual size of the
window at that rank.

Jeff

On Fri, Oct 18, 2013 at 5:40 PM, Jim Dinan <james.dinan at gmail.com> wrote:
> You can return a size of 0.
>
> Jim.
>
> On Oct 18, 2013 5:48 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>>
>> Win shared query always returns a valid base for ranks in MPI comm
>> self, does it not?
>>
>> I don't like the error code approach. Can't we make a magic value
>> mpi_fortran_sucks_null?
>>
>> Sent from my iPhone
>>
>> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>
>> >
>> > Ok, can I rephrase what you want as follows --
>> >
>> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)
>> >
>> > This will return a window, where you *might* be able to do direct
>> > load/store to some of the remote process address spaces (let's call this
>> > "direct access memory").
>> >
>> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error
>> > code, as to whether a given remote process gives you direct access memory.
>> >
>> > An MPI implementation is allowed to ignore the info argument, in which
>> > case, it will give an error for MPI_WIN_SHARED_QUERY for all target
>> > processes.
>> >
>> > Does that sound right?
>> >
>> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +
>> > MPI_WIN_CREATE is that the MPI implementation can better allocate memory.
>> > For example, it might create symmetric address space across nodes and use
>> > shared memory within each node.  This is particularly useful when the
>> > allocation sizes on all processes is the same.
>> >
>> >  -- Pavan
>> >
>> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:
>> >
>> >> I believe that my assumptions about address translation are
>> >> appropriately conservative given the wide portability goals of MPI. Assuming
>> >> Cray Gemini or similar isn't reasonable.
>> >>
>> >> Exclusive lock isn't my primary interest but I'll pay the associated
>> >> costs as necessary.
>> >>
>> >> As discussed with Brian, this is a hint via info. Implementations can
>> >> skip it if they want. I merely want us to standardize the expanded use of
>> >> shared_query that allows this to work.
>> >>
>> >> Jeff
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>> >>
>> >>> This is only correct if you assume that the remote NIC can't translate
>> >>> a displacement.
>> >>>
>> >>> Allowing all processes on the node direct access to the window buffer
>> >>> will require us to perform memory barriers in window synchronizations, and
>> >>> it will cause all lock operations that target the same node to block.  I
>> >>> understand the value in providing this usage model, but what will be the
>> >>> performance cost?
>> >>>
>> >>> ~Jim.
>> >>>
>> >>>
>> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <jeff.science at gmail.com>
>> >>> wrote:
>> >>> It is impossible to so O(1) state w create unless your force the
>> >>> remote side to do all the translation and thus it precludes RDMA. If you
>> >>> want RDMA impl, create require O(P) state. Allocate does not require it
>> >>>
>> >>> I believe all of this was thoroughly discussed when we proposed
>> >>> allocate.
>> >>>
>> >>> Sent from my iPhone
>> >>>
>> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>> >>>
>> >>>> Why is MPI_Win_create not scalable?  There are certainly
>> >>>> implementations and use cases (e.g. not using different disp_units) that can
>> >>>> avoid O(P) metadata per process.  It's more likely that MPI_Win_allocate can
>> >>>> avoid these in more cases, but it's not guaranteed.  It seems like an
>> >>>> implementation could leverage the Win_allocate_share/Win_create combo to
>> >>>> achieve the same scaling result as MPI_Win_allocate.
>> >>>>
>> >>>> If we allow other window types to return pointers through
>> >>>> win_shared_query, then we will have to perform memory barriers in all of the
>> >>>> RMA synchronization routines all the time.
>> >>>>
>> >>>> ~Jim.
>> >>>>
>> >>>>
>> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond
>> >>>> <jeff.science at gmail.com> wrote:
>> >>>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
>> >>>> not scalable.  And it requires two windows instead of one.
>> >>>>
>> >>>> Brian, Pavan and Xin all seem to agree that this is straightforward
>> >>>> to
>> >>>> implement as an optional feature.  We just need to figure out how to
>> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
>> >>>>
>> >>>> Jeff
>> >>>>
>> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com>
>> >>>> wrote:
>> >>>>> Jeff,
>> >>>>>
>> >>>>> Sorry, I haven't read the whole thread closely, so please ignore me
>> >>>>> if this
>> >>>>> is nonsense.  Can you get what you want by doing
>> >>>>> MPI_Win_allocate_shared()
>> >>>>> to create an intranode window, and then pass the buffer allocated by
>> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an internode
>> >>>>> window?
>> >>>>>
>> >>>>> ~Jim.
>> >>>>>
>> >>>>>
>> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond
>> >>>>> <jeff.science at gmail.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>> >>>>>> MPI_Win_create because the former allocates the shared memory
>> >>>>>> business.  It was implied that the latter requires more work within
>> >>>>>> the node. (I thought mmap could do the same magic on existing
>> >>>>>> allocations, but that's not really the point here.)
>> >>>>>>
>> >>>>>> But within a node, what's even better than a window allocated with
>> >>>>>> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>> >>>>>> since the latter permits load-store.  Then I wondered if it would
>> >>>>>> be
>> >>>>>> possible to have both (1) direct load-store access within a node
>> >>>>>> and
>> >>>>>> (2) scalable metadata for windows spanning many nodes.
>> >>>>>>
>> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>> >>>>>> dropping a second window for the internode part on top of these
>> >>>>>> using
>> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using
>> >>>>>> MPI_Win_allocate.
>> >>>>>>
>> >>>>>> I propose that it be possible to get (1) and (2) by allowing
>> >>>>>> MPI_Win_shared_query to return pointers to shared memory within a
>> >>>>>> node
>> >>>>>> even if the window has
>> >>>>>> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>> >>>>>> When the input argument rank to MPI_Win_shared_query corresponds to
>> >>>>>> memory that is not accessible by load-store, the out arguments size
>> >>>>>> and baseptr are 0 and NULL, respectively.
>> >>>>>>
>> >>>>>> The non-scalable use of this feature would be to loop over all
>> >>>>>> ranks
>> >>>>>> in the group associated with the window and test for baseptr!=NULL,
>> >>>>>> while the scalable use would presumably utilize
>> >>>>>> MPI_Comm_split_type,
>> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the list
>> >>>>>> of
>> >>>>>> ranks corresponding to the node, hence the ones that might permit
>> >>>>>> direct access.
>> >>>>>>
>> >>>>>> Comments are appreciate.
>> >>>>>>
>> >>>>>> Jeff
>> >>>>>>
>> >>>>>> --
>> >>>>>> Jeff Hammond
>> >>>>>> jeff.science at gmail.com
>> >>>>>> _______________________________________________
>> >>>>>> mpiwg-rma mailing list
>> >>>>>> mpiwg-rma at lists.mpi-forum.org
>> >>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> _______________________________________________
>> >>>>> mpiwg-rma mailing list
>> >>>>> mpiwg-rma at lists.mpi-forum.org
>> >>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Jeff Hammond
>> >>>> jeff.science at gmail.com
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>>
>> >>>> _______________________________________________
>> >>>> mpiwg-rma mailing list
>> >>>> mpiwg-rma at lists.mpi-forum.org
>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>
>> >>> _______________________________________________
>> >>> mpiwg-rma mailing list
>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >>>
>> >>> _______________________________________________
>> >>> mpiwg-rma mailing list
>> >>> mpiwg-rma at lists.mpi-forum.org
>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >> _______________________________________________
>> >> mpiwg-rma mailing list
>> >> mpiwg-rma at lists.mpi-forum.org
>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> >
>> > --
>> > Pavan Balaji
>> > http://www.mcs.anl.gov/~balaji
>> >
>> > _______________________________________________
>> > mpiwg-rma mailing list
>> > mpiwg-rma at lists.mpi-forum.org
>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma



-- 
Jeff Hammond
jeff.science at gmail.com



More information about the mpiwg-rma mailing list