Jeff, <div><br></div><div>Just to add more info for the O(N) argument. On IB, I implemented window dynamic and I exchange registration information on-demand. So the information is only maintained for the processes you actually communicate with. </div>

<div><br></div><div>Sreeram Potluri<br><br><div class="gmail_quote">On Fri, Sep 23, 2011 at 10:19 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">This is one case where I do not care about progress :-)  I just want<br>

to avoid deadlock.  I'd be happy to enqueue all window creation calls<br>

and have a single collective that would let me fire all of them at<br>

once.  I get that this is not really MPI style, but no one has<br>

convinced me that there is a portable, scalable way to solve this<br>

problem.<br>

<br>

I can solve the O(N) metadata problem on BG, it just requires more<br>

work.  Almasi has already done this, it just isn't in MPI.  The<br>

motivation for putting windows on small groups instead of world is<br>

that this is sufficient.  Why should I have to allocate a bloody<br>

window on world just to get around the blocking collective nature of<br>

window creation and that deadlock-ability that results?<br>

<font color="#888888"><br>

Jeff<br>

</font><div><div></div><div class="h5"><br>

On Fri, Sep 23, 2011 at 1:27 AM, Barrett, Brian W <<a href="mailto:bwbarre@sandia.gov">bwbarre@sandia.gov</a>> wrote:<br>

> In the Portals 4 implementation of dynamic windows, if the comm is MPI_COMM_WORLD (or has the same group as MPI_COMM_WORLD), there is no O(N) metadata with dynamic when the hardware supports a rational memory registration strategy (and we have NIC designs where this is the case).  There is a single list entry and memory descriptor on each process, which shares a pre-defined portal table index.  Over other networks, there may be some scaling of resources, but there are networks where that is not the case.<br>


><br>

> There will be come scaling of resources with size of group as sub-communicators are created, due to the need to map ranks to endpoints; this is a problem with subgroups in general and not windows or communicators specifically.<br>


><br>

> I think adding non-blocking window create is a bad idea; there are many synchronization points in creating a window, and I'm not happy with the concept of sticking that in a thread to meet the progress rules.<br>


><br>

> Brian<br>

><br>

> --<br>

>  Brian W. Barrett<br>

>  Scalable System Software Group<br>

>  Sandia National Laboratories<br>

> ________________________________________<br>

> From: <a href="mailto:mpi3-rma-bounces@lists.mpi-forum.org">mpi3-rma-bounces@lists.mpi-forum.org</a> [<a href="mailto:mpi3-rma-bounces@lists.mpi-forum.org">mpi3-rma-bounces@lists.mpi-forum.org</a>] on behalf of Jeff Hammond [<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a>]<br>


> Sent: Thursday, September 22, 2011 4:08 PM<br>

> To: MPI 3.0 Remote Memory Access working group<br>

> Subject: Re: [Mpi3-rma] nonblocking MPI_Win_create etc.?<br>

><br>

> The reason to put windows on subgroups is to avoid the O(N) metadata<br>

> in the window associated with registered memory.  For example, on BGP<br>

> a window has an O(N) allocation for DCMF memregions.  In the code my<br>

> friend develops, N=300000 on comm_world but N<200 on a subgroup.  He<br>

> is at the limit of available memory, which is what motivated the use<br>

> case for subgroup windows in the first place.<br>

><br>

> I do not see how one can avoid O(N) metadata with<br>

> MPI_Win_create_dynamic on comm_world in the general case, unless one<br>

> completely abandons RDMA.  How exactly does registered memory become<br>

> visible when the user calls MPI_Win_attach?<br>

><br>

> Jeff<br>

><br>

> On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <<a href="mailto:thakur@mcs.anl.gov">thakur@mcs.anl.gov</a>> wrote:<br>

>> In the new RMA, he could just call MPI_Win_create_dynamic once on comm_world and then locally attach memory to it using MPI_Win_attach. (And avoid using fence synchronization.)<br>

>><br>

>> Rajeev<br>

>><br>

>> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:<br>

>><br>

>>> I work with someone who has a use case for nonblocking window creation<br>

>>> because can get into a deadlock situation unless he does a lot of<br>

>>> bookkeeping.  He's creating windows on subgroups of world that can<br>

>>> (will) overlap.  In order to prevent deadlock, he will have to do a<br>

>>> global collective and figure out how to order all of the window<br>

>>> creation calls so that they do not deadlock, or in the case where that<br>

>>> requires solving an NP-hard problem (it smells like the scheduling<br>

>>> problem to me) or requires too much storage to be practical (he works<br>

>>> at Juelich and regularly runs on 72 racks in VN mode), he will have to<br>

>>> serialize window creation globally.<br>

>>><br>

>>> Nonblocking window creation and a waitall solves this problem.<br>

>>><br>

>>> Thoughts?  I wonder if the semantics of nonblocking collectives -<br>

>>> which do not have tags - are even sufficient in the general case.<br>

>>><br>

>>> Jeff<br>

>>><br>

>>> --<br>

>>> Jeff Hammond<br>

>>> Argonne Leadership Computing Facility<br>

>>> University of Chicago Computation Institute<br>

>>> <a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>

>>> <a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>

>>> <a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>

>>> _______________________________________________<br>

>>> mpi3-rma mailing list<br>

>>> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

>>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

>><br>

>><br>

>> _______________________________________________<br>

>> mpi3-rma mailing list<br>

>> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

>> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

>><br>

><br>

><br>

><br>

> --<br>

> Jeff Hammond<br>

> Argonne Leadership Computing Facility<br>

> University of Chicago Computation Institute<br>

> <a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>

> <a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>

> <a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>

><br>

> _______________________________________________<br>

> mpi3-rma mailing list<br>

> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

><br>

><br>

> _______________________________________________<br>

> mpi3-rma mailing list<br>

> <a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

> <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

><br>

<br>

<br>

<br>

</div></div>--<br>

<div><div></div><div class="h5">Jeff Hammond<br>

Argonne Leadership Computing Facility<br>

University of Chicago Computation Institute<br>

<a href="mailto:jhammond@alcf.anl.gov">jhammond@alcf.anl.gov</a> / <a href="tel:%28630%29%20252-5381" value="+16302525381">(630) 252-5381</a><br>

<a href="http://www.linkedin.com/in/jeffhammond" target="_blank">http://www.linkedin.com/in/jeffhammond</a><br>

<a href="https://wiki.alcf.anl.gov/index.php/User:Jhammond" target="_blank">https://wiki.alcf.anl.gov/index.php/User:Jhammond</a><br>

<br>

_______________________________________________<br>

mpi3-rma mailing list<br>

<a href="mailto:mpi3-rma@lists.mpi-forum.org">mpi3-rma@lists.mpi-forum.org</a><br>

<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</a><br>

</div></div></blockquote></div><br></div>