<div dir="ltr">Yes, please go ahead.  It seems like we should schedule a WG meeting some time soon, to organize ourselves for December.<div><br></div><div> ~Jim.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">

On Tue, Oct 22, 2013 at 11:22 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Are we ready to make a ticket for this?  Seems like we have mostly converged.<br>

<span class="HOEnZb"><font color="#888888"><br>

Jeff<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Fri, Oct 18, 2013 at 6:07 PM, Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> wrote:<br>

> I'm okay with that but it has to be clear that this is only the size<br>

> of the memory accessible by load-store and not the actual size of the<br>

> window at that rank.<br>

><br>

> Jeff<br>

><br>

> On Fri, Oct 18, 2013 at 5:40 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br>

>> You can return a size of 0.<br>

>><br>

>> Jim.<br>

>><br>

>> On Oct 18, 2013 5:48 PM, "Jeff Hammond" <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> wrote:<br>

>>><br>

>>> Win shared query always returns a valid base for ranks in MPI comm<br>

>>> self, does it not?<br>

>>><br>

>>> I don't like the error code approach. Can't we make a magic value<br>

>>> mpi_fortran_sucks_null?<br>

>>><br>

>>> Sent from my iPhone<br>

>>><br>

>>> On Oct 18, 2013, at 4:40 PM, Pavan Balaji <<a href="mailto:balaji@mcs.anl.gov">balaji@mcs.anl.gov</a>> wrote:<br>

>>><br>

>>> ><br>

>>> > Ok, can I rephrase what you want as follows --<br>

>>> ><br>

>>> > MPI_WIN_ALLOCATE(info = gimme_shared_memory)<br>

>>> ><br>

>>> > This will return a window, where you *might* be able to do direct<br>

>>> > load/store to some of the remote process address spaces (let's call this<br>

>>> > "direct access memory").<br>

>>> ><br>

>>> > MPI_WIN_SHARED_QUERY will tell the user, through an appropriate error<br>

>>> > code, as to whether a given remote process gives you direct access memory.<br>

>>> ><br>

>>> > An MPI implementation is allowed to ignore the info argument, in which<br>

>>> > case, it will give an error for MPI_WIN_SHARED_QUERY for all target<br>

>>> > processes.<br>

>>> ><br>

>>> > Does that sound right?<br>

>>> ><br>

>>> > I guess the benefit of this compared to MPI_WIN_ALLOCATE_SHARED +<br>

>>> > MPI_WIN_CREATE is that the MPI implementation can better allocate memory.<br>

>>> > For example, it might create symmetric address space across nodes and use<br>

>>> > shared memory within each node.  This is particularly useful when the<br>

>>> > allocation sizes on all processes is the same.<br>

>>> ><br>

>>> >  -- Pavan<br>

>>> ><br>

>>> > On Oct 18, 2013, at 4:31 PM, Jeff Hammond wrote:<br>

>>> ><br>

>>> >> I believe that my assumptions about address translation are<br>

>>> >> appropriately conservative given the wide portability goals of MPI. Assuming<br>

>>> >> Cray Gemini or similar isn't reasonable.<br>

>>> >><br>

>>> >> Exclusive lock isn't my primary interest but I'll pay the associated<br>

>>> >> costs as necessary.<br>

>>> >><br>

>>> >> As discussed with Brian, this is a hint via info. Implementations can<br>

>>> >> skip it if they want. I merely want us to standardize the expanded use of<br>

>>> >> shared_query that allows this to work.<br>

>>> >><br>

>>> >> Jeff<br>

>>> >><br>

>>> >> Sent from my iPhone<br>

>>> >><br>

>>> >> On Oct 18, 2013, at 4:25 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br>

>>> >><br>

>>> >>> This is only correct if you assume that the remote NIC can't translate<br>

>>> >>> a displacement.<br>

>>> >>><br>

>>> >>> Allowing all processes on the node direct access to the window buffer<br>

>>> >>> will require us to perform memory barriers in window synchronizations, and<br>

>>> >>> it will cause all lock operations that target the same node to block.  I<br>

>>> >>> understand the value in providing this usage model, but what will be the<br>

>>> >>> performance cost?<br>

>>> >>><br>

>>> >>> ~Jim.<br>

>>> >>><br>

>>> >>><br>

>>> >>> On Fri, Oct 18, 2013 at 4:50 PM, Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>><br>

>>> >>> wrote:<br>

>>> >>> It is impossible to so O(1) state w create unless your force the<br>

>>> >>> remote side to do all the translation and thus it precludes RDMA. If you<br>

>>> >>> want RDMA impl, create require O(P) state. Allocate does not require it<br>

>>> >>><br>

>>> >>> I believe all of this was thoroughly discussed when we proposed<br>

>>> >>> allocate.<br>

>>> >>><br>

>>> >>> Sent from my iPhone<br>

>>> >>><br>

>>> >>> On Oct 18, 2013, at 3:16 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br>

>>> >>><br>

>>> >>>> Why is MPI_Win_create not scalable?  There are certainly<br>

>>> >>>> implementations and use cases (e.g. not using different disp_units) that can<br>

>>> >>>> avoid O(P) metadata per process.  It's more likely that MPI_Win_allocate can<br>

>>> >>>> avoid these in more cases, but it's not guaranteed.  It seems like an<br>

>>> >>>> implementation could leverage the Win_allocate_share/Win_create combo to<br>

>>> >>>> achieve the same scaling result as MPI_Win_allocate.<br>

>>> >>>><br>

>>> >>>> If we allow other window types to return pointers through<br>

>>> >>>> win_shared_query, then we will have to perform memory barriers in all of the<br>

>>> >>>> RMA synchronization routines all the time.<br>

>>> >>>><br>

>>> >>>> ~Jim.<br>

>>> >>>><br>

>>> >>>><br>

>>> >>>> On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond<br>

>>> >>>> <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>> wrote:<br>

>>> >>>> Yes, as I said, that's all I can do right now.  But MPI_WIN_CREATE is<br>

>>> >>>> not scalable.  And it requires two windows instead of one.<br>

>>> >>>><br>

>>> >>>> Brian, Pavan and Xin all seem to agree that this is straightforward<br>

>>> >>>> to<br>

>>> >>>> implement as an optional feature.  We just need to figure out how to<br>

>>> >>>> extend the use of MPI_WIN_SHARED_QUERY to enable it.<br>

>>> >>>><br>

>>> >>>> Jeff<br>

>>> >>>><br>

>>> >>>> On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>><br>

>>> >>>> wrote:<br>

>>> >>>>> Jeff,<br>

>>> >>>>><br>

>>> >>>>> Sorry, I haven't read the whole thread closely, so please ignore me<br>

>>> >>>>> if this<br>

>>> >>>>> is nonsense.  Can you get what you want by doing<br>

>>> >>>>> MPI_Win_allocate_shared()<br>

>>> >>>>> to create an intranode window, and then pass the buffer allocated by<br>

>>> >>>>> MPI_Win_allocate_shared to MPI_Win_create() to create an internode<br>

>>> >>>>> window?<br>

>>> >>>>><br>

>>> >>>>> ~Jim.<br>

>>> >>>>><br>

>>> >>>>><br>

>>> >>>>> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond<br>

>>> >>>>> <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>><br>

>>> >>>>> wrote:<br>

>>> >>>>>><br>

>>> >>>>>> Pavan told me that (in MPICH) MPI_Win_allocate is way better than<br>

>>> >>>>>> MPI_Win_create because the former allocates the shared memory<br>

>>> >>>>>> business.  It was implied that the latter requires more work within<br>

>>> >>>>>> the node. (I thought mmap could do the same magic on existing<br>

>>> >>>>>> allocations, but that's not really the point here.)<br>

>>> >>>>>><br>

>>> >>>>>> But within a node, what's even better than a window allocated with<br>

>>> >>>>>> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,<br>

>>> >>>>>> since the latter permits load-store.  Then I wondered if it would<br>

>>> >>>>>> be<br>

>>> >>>>>> possible to have both (1) direct load-store access within a node<br>

>>> >>>>>> and<br>

>>> >>>>>> (2) scalable metadata for windows spanning many nodes.<br>

>>> >>>>>><br>

>>> >>>>>> I can get (1) but not (2) by using MPI_Win_allocate_shared and then<br>

>>> >>>>>> dropping a second window for the internode part on top of these<br>

>>> >>>>>> using<br>

>>> >>>>>> MPI_Win_create.  Of course, I can get (2) but not (1) using<br>

>>> >>>>>> MPI_Win_allocate.<br>

>>> >>>>>><br>

>>> >>>>>> I propose that it be possible to get (1) and (2) by allowing<br>

>>> >>>>>> MPI_Win_shared_query to return pointers to shared memory within a<br>

>>> >>>>>> node<br>

>>> >>>>>> even if the window has<br>

>>> >>>>>> MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.<br>

>>> >>>>>> When the input argument rank to MPI_Win_shared_query corresponds to<br>

>>> >>>>>> memory that is not accessible by load-store, the out arguments size<br>

>>> >>>>>> and baseptr are 0 and NULL, respectively.<br>

>>> >>>>>><br>

>>> >>>>>> The non-scalable use of this feature would be to loop over all<br>

>>> >>>>>> ranks<br>

>>> >>>>>> in the group associated with the window and test for baseptr!=NULL,<br>

>>> >>>>>> while the scalable use would presumably utilize<br>

>>> >>>>>> MPI_Comm_split_type,<br>

>>> >>>>>> MPI_Comm_group and MPI_Group_translate_ranks to determine the list<br>

>>> >>>>>> of<br>

>>> >>>>>> ranks corresponding to the node, hence the ones that might permit<br>

>>> >>>>>> direct access.<br>

>>> >>>>>><br>

>>> >>>>>> Comments are appreciate.<br>

>>> >>>>>><br>

>>> >>>>>> Jeff<br>

>>> >>>>>><br>

>>> >>>>>> --<br>

>>> >>>>>> Jeff Hammond<br>

>>> >>>>>> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>

>>> >>>>>> _______________________________________________<br>

>>> >>>>>> mpiwg-rma mailing list<br>

>>> >>>>>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>>>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >>>>><br>

>>> >>>>><br>

>>> >>>>><br>

>>> >>>>> _______________________________________________<br>

>>> >>>>> mpiwg-rma mailing list<br>

>>> >>>>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >>>><br>

>>> >>>><br>

>>> >>>><br>

>>> >>>> --<br>

>>> >>>> Jeff Hammond<br>

>>> >>>> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>

>>> >>>> _______________________________________________<br>

>>> >>>> mpiwg-rma mailing list<br>

>>> >>>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >>>><br>

>>> >>>> _______________________________________________<br>

>>> >>>> mpiwg-rma mailing list<br>

>>> >>>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >>><br>

>>> >>> _______________________________________________<br>

>>> >>> mpiwg-rma mailing list<br>

>>> >>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >>><br>

>>> >>> _______________________________________________<br>

>>> >>> mpiwg-rma mailing list<br>

>>> >>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> >> _______________________________________________<br>

>>> >> mpiwg-rma mailing list<br>

>>> >> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> >> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> ><br>

>>> > --<br>

>>> > Pavan Balaji<br>

>>> > <a href="http://www.mcs.anl.gov/~balaji" target="_blank">http://www.mcs.anl.gov/~balaji</a><br>

>>> ><br>

>>> > _______________________________________________<br>

>>> > mpiwg-rma mailing list<br>

>>> > <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> > <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>>> _______________________________________________<br>

>>> mpiwg-rma mailing list<br>

>>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

>><br>

>><br>

>> _______________________________________________<br>

>> mpiwg-rma mailing list<br>

>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

><br>

><br>

><br>

> --<br>

> Jeff Hammond<br>

> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>

<br>

<br>

<br>

--<br>

Jeff Hammond<br>

<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>

_______________________________________________<br>

mpiwg-rma mailing list<br>

<a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>

</div></div></blockquote></div><br></div>