<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div>It is impossible to so O(1) state w create unless your force the remote side to do all the translation and thus it precludes RDMA. If you want RDMA impl, create require O(P) state. Allocate does not require it</div>
<div><br></div><div>I believe all of this was thoroughly discussed when we proposed allocate. <br><br>Sent from my iPhone</div><div><br>On Oct 18, 2013, at 3:16 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br>
<br></div><blockquote type="cite"><div><div dir="ltr">Why is MPI_Win_create not scalable? There are certainly implementations and use cases (e.g. not using different disp_units) that can avoid O(P) metadata per process. It's more likely that MPI_Win_allocate can avoid these in more cases, but it's not guaranteed. It seems like an implementation could leverage the Win_allocate_share/Win_create combo to achieve the same scaling result as MPI_Win_allocate.<div>
<div><br></div><div>If we allow other window types to return pointers through win_shared_query, then we will have to perform memory barriers in all of the RMA synchronization routines all the time.</div><div><br></div><div>
~Jim.</div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Oct 18, 2013 at 3:39 PM, Jeff Hammond <span dir="ltr"><<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Yes, as I said, that's all I can do right now. But MPI_WIN_CREATE is<br>
not scalable. And it requires two windows instead of one.<br>
<br>
Brian, Pavan and Xin all seem to agree that this is straightforward to<br>
implement as an optional feature. We just need to figure out how to<br>
extend the use of MPI_WIN_SHARED_QUERY to enable it.<br>
<span class="HOEnZb"><font color="#888888"><br>
Jeff<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Fri, Oct 18, 2013 at 2:35 PM, Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br>
> Jeff,<br>
><br>
> Sorry, I haven't read the whole thread closely, so please ignore me if this<br>
> is nonsense. Can you get what you want by doing MPI_Win_allocate_shared()<br>
> to create an intranode window, and then pass the buffer allocated by<br>
> MPI_Win_allocate_shared to MPI_Win_create() to create an internode window?<br>
><br>
> ~Jim.<br>
><br>
><br>
> On Sat, Oct 12, 2013 at 3:49 PM, Jeff Hammond <<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a>><br>
> wrote:<br>
>><br>
>> Pavan told me that (in MPICH) MPI_Win_allocate is way better than<br>
>> MPI_Win_create because the former allocates the shared memory<br>
>> business. It was implied that the latter requires more work within<br>
>> the node. (I thought mmap could do the same magic on existing<br>
>> allocations, but that's not really the point here.)<br>
>><br>
>> But within a node, what's even better than a window allocated with<br>
>> MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,<br>
>> since the latter permits load-store. Then I wondered if it would be<br>
>> possible to have both (1) direct load-store access within a node and<br>
>> (2) scalable metadata for windows spanning many nodes.<br>
>><br>
>> I can get (1) but not (2) by using MPI_Win_allocate_shared and then<br>
>> dropping a second window for the internode part on top of these using<br>
>> MPI_Win_create. Of course, I can get (2) but not (1) using<br>
>> MPI_Win_allocate.<br>
>><br>
>> I propose that it be possible to get (1) and (2) by allowing<br>
>> MPI_Win_shared_query to return pointers to shared memory within a node<br>
>> even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.<br>
>> When the input argument rank to MPI_Win_shared_query corresponds to<br>
>> memory that is not accessible by load-store, the out arguments size<br>
>> and baseptr are 0 and NULL, respectively.<br>
>><br>
>> The non-scalable use of this feature would be to loop over all ranks<br>
>> in the group associated with the window and test for baseptr!=NULL,<br>
>> while the scalable use would presumably utilize MPI_Comm_split_type,<br>
>> MPI_Comm_group and MPI_Group_translate_ranks to determine the list of<br>
>> ranks corresponding to the node, hence the ones that might permit<br>
>> direct access.<br>
>><br>
>> Comments are appreciate.<br>
>><br>
>> Jeff<br>
>><br>
>> --<br>
>> Jeff Hammond<br>
>> <a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>
>> _______________________________________________<br>
>> mpiwg-rma mailing list<br>
>> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>
>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>
><br>
><br>
><br>
> _______________________________________________<br>
> mpiwg-rma mailing list<br>
> <a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>
> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>
<br>
<br>
<br>
--<br>
Jeff Hammond<br>
<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>
_______________________________________________<br>
mpiwg-rma mailing list<br>
<a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>
</div></div></blockquote></div><br></div>
</div></blockquote><blockquote type="cite"><div><span>_______________________________________________</span><br><span>mpiwg-rma mailing list</span><br><span><a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a></span><br>
<span><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a></span></div></blockquote></body></html>