[Mpi3-rma] nonblocking MPI_Win_create etc.?
Jeff Hammond
jhammond at alcf.anl.gov
Fri Sep 23 02:19:53 CDT 2011
This is one case where I do not care about progress :-) I just want
to avoid deadlock. I'd be happy to enqueue all window creation calls
and have a single collective that would let me fire all of them at
once. I get that this is not really MPI style, but no one has
convinced me that there is a portable, scalable way to solve this
problem.
I can solve the O(N) metadata problem on BG, it just requires more
work. Almasi has already done this, it just isn't in MPI. The
motivation for putting windows on small groups instead of world is
that this is sufficient. Why should I have to allocate a bloody
window on world just to get around the blocking collective nature of
window creation and that deadlock-ability that results?
Jeff
On Fri, Sep 23, 2011 at 1:27 AM, Barrett, Brian W <bwbarre at sandia.gov> wrote:
> In the Portals 4 implementation of dynamic windows, if the comm is MPI_COMM_WORLD (or has the same group as MPI_COMM_WORLD), there is no O(N) metadata with dynamic when the hardware supports a rational memory registration strategy (and we have NIC designs where this is the case). There is a single list entry and memory descriptor on each process, which shares a pre-defined portal table index. Over other networks, there may be some scaling of resources, but there are networks where that is not the case.
>
> There will be come scaling of resources with size of group as sub-communicators are created, due to the need to map ranks to endpoints; this is a problem with subgroups in general and not windows or communicators specifically.
>
> I think adding non-blocking window create is a bad idea; there are many synchronization points in creating a window, and I'm not happy with the concept of sticking that in a thread to meet the progress rules.
>
> Brian
>
> --
> Brian W. Barrett
> Scalable System Software Group
> Sandia National Laboratories
> ________________________________________
> From: mpi3-rma-bounces at lists.mpi-forum.org [mpi3-rma-bounces at lists.mpi-forum.org] on behalf of Jeff Hammond [jhammond at alcf.anl.gov]
> Sent: Thursday, September 22, 2011 4:08 PM
> To: MPI 3.0 Remote Memory Access working group
> Subject: Re: [Mpi3-rma] nonblocking MPI_Win_create etc.?
>
> The reason to put windows on subgroups is to avoid the O(N) metadata
> in the window associated with registered memory. For example, on BGP
> a window has an O(N) allocation for DCMF memregions. In the code my
> friend develops, N=300000 on comm_world but N<200 on a subgroup. He
> is at the limit of available memory, which is what motivated the use
> case for subgroup windows in the first place.
>
> I do not see how one can avoid O(N) metadata with
> MPI_Win_create_dynamic on comm_world in the general case, unless one
> completely abandons RDMA. How exactly does registered memory become
> visible when the user calls MPI_Win_attach?
>
> Jeff
>
> On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <thakur at mcs.anl.gov> wrote:
>> In the new RMA, he could just call MPI_Win_create_dynamic once on comm_world and then locally attach memory to it using MPI_Win_attach. (And avoid using fence synchronization.)
>>
>> Rajeev
>>
>> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:
>>
>>> I work with someone who has a use case for nonblocking window creation
>>> because can get into a deadlock situation unless he does a lot of
>>> bookkeeping. He's creating windows on subgroups of world that can
>>> (will) overlap. In order to prevent deadlock, he will have to do a
>>> global collective and figure out how to order all of the window
>>> creation calls so that they do not deadlock, or in the case where that
>>> requires solving an NP-hard problem (it smells like the scheduling
>>> problem to me) or requires too much storage to be practical (he works
>>> at Juelich and regularly runs on 72 racks in VN mode), he will have to
>>> serialize window creation globally.
>>>
>>> Nonblocking window creation and a waitall solves this problem.
>>>
>>> Thoughts? I wonder if the semantics of nonblocking collectives -
>>> which do not have tags - are even sufficient in the general case.
>>>
>>> Jeff
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>>> _______________________________________________
>>> mpi3-rma mailing list
>>> mpi3-rma at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>>
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>>
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
--
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/index.php/User:Jhammond
More information about the mpiwg-rma
mailing list