[mpiwg-rma] shared-like access within a node with non-shared windows

Tue Oct 22 10:22:15 CDT 2013

Are we ready to make a ticket for this?  Seems like we have mostly converged.

Jeff

On Fri, Oct 18, 2013 at 6:07 PM, Jeff Hammond <jeff.science at gmail.com> wrote:
> I'm okay with that but it has to be clear that this is only the size
> of the memory accessible by load-store and not the actual size of the
> window at that rank.
>
> Jeff
>
> On Fri, Oct 18, 2013 at 5:40 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>> You can return a size of 0.
>>
>> Jim.
>>
>> On Oct 18, 2013 5:48 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>>>
>>> Win shared query always returns a valid base for ranks in MPI comm
>>> self, does it not?
>>>
>>> I don't like the error code approach. Can't we make a magic value
>>> mpi_fortran_sucks_null?
>>>
>>> Sent from my iPhone
>>>
>>> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>>>
>>> >
>>> > Ok, can I rephrase what you want as follows --
>>> >
>>> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)
>>> >
>>> > This will return a window, where you *might* be able to do direct
>>> > load/store to some of the remote process address spaces (let's call this
>>> > "direct access memory").
>>> >
>>> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error
>>> > code, as to whether a given remote process gives you direct access memory.
>>> >
>>> > An MPI implementation is allowed to ignore the info argument, in which
>>> > case, it will give an error for MPI_WIN_SHARED_QUERY for all target
>>> > processes.
>>> >
>>> > Does that sound right?
>>> >
>>> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +
>>> > MPI_WIN_CREATE is that the MPI implementation can better allocate memory.
>>> > For example, it might create symmetric address space across nodes and use
>>> > shared memory within each node.  This is particularly useful when the
>>> > allocation sizes on all processes is the same.
>>> >
>>> >  -- Pavan
>>> >
>>> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:
>>> >
>>> >> I believe that my assumptions about address translation are
>>> >> appropriately conservative given the wide portability goals of MPI. Assuming
>>> >> Cray Gemini or similar isn't reasonable.
>>> >>
>>> >> Exclusive lock isn't my primary interest but I'll pay the associated
>>> >> costs as necessary.
>>> >>
>>> >> As discussed with Brian, this is a hint via info. Implementations can
>>> >> skip it if they want. I merely want us to standardize the expanded use of
>>> >> shared_query that allows this to work.
>>> >>
>>> >> Jeff
>>> >>
>>> >> Sent from my iPhone
>>> >>
>>> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>> >>
>>> >>> This is only correct if you assume that the remote NIC can't translate
>>> >>> a displacement.
>>> >>>
>>> >>> Allowing all processes on the node direct access to the window buffer
>>> >>> will require us to perform memory barriers in window synchronizations, and
>>> >>> it will cause all lock operations that target the same node to block.  I
>>> >>> understand the value in providing this usage model, but what will be the
>>> >>> performance cost?
>>> >>>
>>> >>> ~Jim.
>>> >>>
>>> >>>
>>> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <jeff.science at gmail.com>
>>> >>> wrote:
>>> >>> It is impossible to so O(1) state w create unless your force the
>>> >>> remote side to do all the translation and thus it precludes RDMA. If you
>>> >>> want RDMA impl, create require O(P) state. Allocate does not require it
>>> >>>
>>> >>> I believe all of this was thoroughly discussed when we proposed
>>> >>> allocate.
>>> >>>
>>> >>> Sent from my iPhone
>>> >>>
>>> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <james.dinan at gmail.com> wrote:
>>> >>>
>>> >>>> Why is MPI_Win_create not scalable?  There are certainly
>>> >>>> implementations and use cases (e.g. not using different disp_units) that can
>>> >>>> avoid O(P) metadata per process.  It's more likely that MPI_Win_allocate can
>>> >>>> avoid these in more cases, but it's not guaranteed.  It seems like an
>>> >>>> implementation could leverage the Win_allocate_share/Win_create combo to
>>> >>>> achieve the same scaling result as MPI_Win_allocate.
>>> >>>>
>>> >>>> If we allow other window types to return pointers through
>>> >>>> win_shared_query, then we will have to perform memory barriers in all of the
>>> >>>> RMA synchronization routines all the time.
>>> >>>>
>>> >>>> ~Jim.
>>> >>>>
>>> >>>>
>>> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond
>>> >>>> <jeff.science at gmail.com> wrote:
>>> >>>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is
>>> >>>> not scalable.  And it requires two windows instead of one.
>>> >>>>
>>> >>>> Brian, Pavan and Xin all seem to agree that this is straightforward
>>> >>>> to
>>> >>>> implement as an optional feature.  We just need to figure out how to
>>> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.
>>> >>>>
>>> >>>> Jeff
>>> >>>>
>>> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <james.dinan at gmail.com>
>>> >>>> wrote:
>>> >>>>> Jeff,
>>> >>>>>
>>> >>>>> Sorry, I haven't read the whole thread closely, so please ignore me
>>> >>>>> if this
>>> >>>>> is nonsense.  Can you get what you want by doing
>>> >>>>> MPI_Win_allocate_shared()
>>> >>>>> to create an intranode window, and then pass the buffer allocated by
>>> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an internode
>>> >>>>> window?
>>> >>>>>
>>> >>>>> ~Jim.
>>> >>>>>
>>> >>>>>
>>> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond
>>> >>>>> <jeff.science at gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>>> >>>>>> MPI_Win_create because the former allocates the shared memory
>>> >>>>>> business.  It was implied that the latter requires more work within
>>> >>>>>> the node. (I thought mmap could do the same magic on existing
>>> >>>>>> allocations, but that's not really the point here.)
>>> >>>>>>
>>> >>>>>> But within a node, what's even better than a window allocated with
>>> >>>>>> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>>> >>>>>> since the latter permits load-store.  Then I wondered if it would
>>> >>>>>> be
>>> >>>>>> possible to have both (1) direct load-store access within a node
>>> >>>>>> and
>>> >>>>>> (2) scalable metadata for windows spanning many nodes.
>>> >>>>>>
>>> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>>> >>>>>> dropping a second window for the internode part on top of these
>>> >>>>>> using
>>> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using
>>> >>>>>> MPI_Win_allocate.
>>> >>>>>>
>>> >>>>>> I propose that it be possible to get (1) and (2) by allowing
>>> >>>>>> MPI_Win_shared_query to return pointers to shared memory within a
>>> >>>>>> node
>>> >>>>>> even if the window has
>>> >>>>>> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>>> >>>>>> When the input argument rank to MPI_Win_shared_query corresponds to
>>> >>>>>> memory that is not accessible by load-store, the out arguments size
>>> >>>>>> and baseptr are 0 and NULL, respectively.
>>> >>>>>>
>>> >>>>>> The non-scalable use of this feature would be to loop over all
>>> >>>>>> ranks
>>> >>>>>> in the group associated with the window and test for baseptr!=NULL,
>>> >>>>>> while the scalable use would presumably utilize
>>> >>>>>> MPI_Comm_split_type,
>>> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the list
>>> >>>>>> of
>>> >>>>>> ranks corresponding to the node, hence the ones that might permit
>>> >>>>>> direct access.
>>> >>>>>>
>>> >>>>>> Comments are appreciate.
>>> >>>>>>
>>> >>>>>> Jeff
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Jeff Hammond
>>> >>>>>> jeff.science at gmail.com
>>> >>>>>> _______________________________________________
>>> >>>>>> mpiwg-rma mailing list
>>> >>>>>> mpiwg-rma at lists.mpi-forum.org
>>> >>>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> mpiwg-rma mailing list
>>> >>>>> mpiwg-rma at lists.mpi-forum.org
>>> >>>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Jeff Hammond
>>> >>>> jeff.science at gmail.com
>>> >>>> _______________________________________________
>>> >>>> mpiwg-rma mailing list
>>> >>>> mpiwg-rma at lists.mpi-forum.org
>>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> mpiwg-rma mailing list
>>> >>>> mpiwg-rma at lists.mpi-forum.org
>>> >>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >>>
>>> >>> _______________________________________________
>>> >>> mpiwg-rma mailing list
>>> >>> mpiwg-rma at lists.mpi-forum.org
>>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >>>
>>> >>> _______________________________________________
>>> >>> mpiwg-rma mailing list
>>> >>> mpiwg-rma at lists.mpi-forum.org
>>> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >> _______________________________________________
>>> >> mpiwg-rma mailing list
>>> >> mpiwg-rma at lists.mpi-forum.org
>>> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> >
>>> > --
>>> > Pavan Balaji
>>> > http://www.mcs.anl.gov/~balaji
>>> >
>>> > _______________________________________________
>>> > mpiwg-rma mailing list
>>> > mpiwg-rma at lists.mpi-forum.org
>>> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>> _______________________________________________
>>> mpiwg-rma mailing list
>>> mpiwg-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>>
>>
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com

-- 
Jeff Hammond
jeff.science at gmail.com