Jeff, <div><br></div><div>Just to add more info for the O(N) argument. On IB, I implemented window dynamic and I exchange registration information on-demand. So the information is only maintained for the processes you actually communicate with. </div>
<div><br></div><div>Sreeram Potluri<br><br><div class="gmail_quote">On Fri, Sep 23, 2011 at 10:19 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">This is one case where I do not care about progress :-) I just want<br>
to avoid deadlock. I'd be happy to enqueue all window creation calls<br>
and have a single collective that would let me fire all of them at<br>
once. I get that this is not really MPI style, but no one has<br>
convinced me that there is a portable, scalable way to solve this<br>
problem.<br>
<br>
I can solve the O(N) metadata problem on BG, it just requires more<br>
work. Almasi has already done this, it just isn't in MPI. The<br>
motivation for putting windows on small groups instead of world is<br>
that this is sufficient. Why should I have to allocate a bloody<br>
window on world just to get around the blocking collective nature of<br>
window creation and that deadlock-ability that results?<br>
<font color="#888888"><br>
Jeff<br>
</font><div><div></div><div class="h5"><br>
On Fri, Sep 23, 2011 at 1:27 AM, Barrett, Brian W <<a href="mailto:bwbarre@sandia.gov">bwbarre@sandia.gov</a>> wrote:<br>
> In the Portals 4 implementation of dynamic windows, if the comm is MPI_COMM_WORLD (or has the same group as MPI_COMM_WORLD), there is no O(N) metadata with dynamic when the hardware supports a rational memory registration strategy (and we have NIC designs where this is the case). There is a single list entry and memory descriptor on each process, which shares a pre-defined portal table index. Over other networks, there may be some scaling of resources, but there are networks where that is not the case.<br>
><br>
> There will be come scaling of resources with size of group as sub-communicators are created, due to the need to map ranks to endpoints; this is a problem with subgroups in general and not windows or communicators specifically.<br>
><br>
> I think adding non-blocking window create is a bad idea; there are many synchronization points in creating a window, and I'm not happy with the concept of sticking that in a thread to meet the progress rules.<br>
><br>
> Brian<br>
><br>
> --<br>
> Brian W. Barrett<br>
> Scalable System Software Group<br>
> Sandia National Laboratories<br>
> ________________________________________<br>
> From: <a href="mailto:mpi3-rma-bounces@lists.mpi-forum.org">mpi3-rma-bounces@lists.mpi-forum.org</a> [<a href="mailto:mpi3-rma-bounces@lists.mpi-forum.org">mpi3-rma-bounces@lists.mpi-forum.org</a>] on behalf of Jeff Hammond [<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>]<br>
> Sent: Thursday, September 22, 2011 4:08 PM<br>
> To: MPI 3.0 Remote Memory Access working group<br>
> Subject: Re: [Mpi3-rma] nonblocking MPI_Win_create etc.?<br>
><br>
> The reason to put windows on subgroups is to avoid the O(N) metadata<br>
> in the window associated with registered memory. For example, on BGP<br>
> a window has an O(N) allocation for DCMF memregions. In the code my<br>
> friend develops, N=300000 on comm_world but N<200 on a subgroup. He<br>
> is at the limit of available memory, which is what motivated the use<br>
> case for subgroup windows in the first place.<br>
><br>
> I do not see how one can avoid O(N) metadata with<br>
> MPI_Win_create_dynamic on comm_world in the general case, unless one<br>
> completely abandons RDMA. How exactly does registered memory become<br>
> visible when the user calls MPI_Win_attach?<br>
><br>
> Jeff<br>
><br>
> On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>> wrote:<br>
>> In the new RMA, he could just call MPI_Win_create_dynamic once on comm_world and then locally attach memory to it using MPI_Win_attach. (And avoid using fence synchronization.)<br>
>><br>
>> Rajeev<br>
>><br>
>> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:<br>
>><br>
>>> I work with someone who has a use case for nonblocking window creation<br>
>>> because can get into a deadlock situation unless he does a lot of<br>
>>> bookkeeping. He's creating windows on subgroups of world that can<br>
>>> (will) overlap. In order to prevent deadlock, he will have to do a<br>
>>> global collective and figure out how to order all of the window<br>
>>> creation calls so that they do not deadlock, or in the case where that<br>
>>> requires solving an NP-hard problem (it smells like the scheduling<br>
>>> problem to me) or requires too much storage to be practical (he works<br>
>>> at Juelich and regularly runs on 72 racks in VN mode), he will have to<br>
>>> serialize window creation globally.<br>
>>><br>
>>> Nonblocking window creation and a waitall solves this problem.<br>
>>><br>
>>> Thoughts? I wonder if the semantics of nonblocking collectives -<br>
>>> which do not have tags - are even sufficient in the general case.<br>
>>><br>
>>> Jeff<br>
>>><br>
>>> --<br>
>>> Jeff Hammond<br>
>>> Argonne Leadership Computing Facility<br>
>>> University of Chicago Computation Institute<br>
>>> <a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>
>>> <a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
>>> <a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>
>>> _______________________________________________<br>
>>> mpi3-rma mailing list<br>
>>> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
>><br>
>><br>
>> _______________________________________________<br>
>> mpi3-rma mailing list<br>
>> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
>><br>
><br>
><br>
><br>
> --<br>
> Jeff Hammond<br>
> Argonne Leadership Computing Facility<br>
> University of Chicago Computation Institute<br>
> <a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>
> <a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
> <a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>
><br>
> _______________________________________________<br>
> mpi3-rma mailing list<br>
> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
><br>
><br>
> _______________________________________________<br>
> mpi3-rma mailing list<br>
> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
><br>
<br>
<br>
<br>
</div></div>--<br>
<div><div></div><div class="h5">Jeff Hammond<br>
Argonne Leadership Computing Facility<br>
University of Chicago Computation Institute<br>
<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>
<a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>
<a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>
<br>
_______________________________________________<br>
mpi3-rma mailing list<br>
<a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>
</div></div></blockquote></div><br></div>