[Mpi3-rma] nonblocking MPI_Win_create etc.?

sreeram potluri potluri at cse.ohio-state.edu
Fri Sep 23 04:15:14 CDT 2011


Jeff,

Just to add more info for the O(N) argument. On IB, I implemented window
dynamic and I exchange registration information on-demand. So the
information is only maintained for the processes you actually communicate
with.

Sreeram Potluri

On Fri, Sep 23, 2011 at 10:19 AM, Jeff Hammond <jhammond at alcf.anl.gov>wrote:

> This is one case where I do not care about progress :-)  I just want
> to avoid deadlock.  I'd be happy to enqueue all window creation calls
> and have a single collective that would let me fire all of them at
> once.  I get that this is not really MPI style, but no one has
> convinced me that there is a portable, scalable way to solve this
> problem.
>
> I can solve the O(N) metadata problem on BG, it just requires more
> work.  Almasi has already done this, it just isn't in MPI.  The
> motivation for putting windows on small groups instead of world is
> that this is sufficient.  Why should I have to allocate a bloody
> window on world just to get around the blocking collective nature of
> window creation and that deadlock-ability that results?
>
> Jeff
>
> On Fri, Sep 23, 2011 at 1:27 AM, Barrett, Brian W <bwbarre at sandia.gov>
> wrote:
> > In the Portals 4 implementation of dynamic windows, if the comm is
> MPI_COMM_WORLD (or has the same group as MPI_COMM_WORLD), there is no O(N)
> metadata with dynamic when the hardware supports a rational memory
> registration strategy (and we have NIC designs where this is the case).
>  There is a single list entry and memory descriptor on each process, which
> shares a pre-defined portal table index.  Over other networks, there may be
> some scaling of resources, but there are networks where that is not the
> case.
> >
> > There will be come scaling of resources with size of group as
> sub-communicators are created, due to the need to map ranks to endpoints;
> this is a problem with subgroups in general and not windows or communicators
> specifically.
> >
> > I think adding non-blocking window create is a bad idea; there are many
> synchronization points in creating a window, and I'm not happy with the
> concept of sticking that in a thread to meet the progress rules.
> >
> > Brian
> >
> > --
> >  Brian W. Barrett
> >  Scalable System Software Group
> >  Sandia National Laboratories
> > ________________________________________
> > From: mpi3-rma-bounces at lists.mpi-forum.org [
> mpi3-rma-bounces at lists.mpi-forum.org] on behalf of Jeff Hammond [
> jhammond at alcf.anl.gov]
> > Sent: Thursday, September 22, 2011 4:08 PM
> > To: MPI 3.0 Remote Memory Access working group
> > Subject: Re: [Mpi3-rma] nonblocking MPI_Win_create etc.?
> >
> > The reason to put windows on subgroups is to avoid the O(N) metadata
> > in the window associated with registered memory.  For example, on BGP
> > a window has an O(N) allocation for DCMF memregions.  In the code my
> > friend develops, N=300000 on comm_world but N<200 on a subgroup.  He
> > is at the limit of available memory, which is what motivated the use
> > case for subgroup windows in the first place.
> >
> > I do not see how one can avoid O(N) metadata with
> > MPI_Win_create_dynamic on comm_world in the general case, unless one
> > completely abandons RDMA.  How exactly does registered memory become
> > visible when the user calls MPI_Win_attach?
> >
> > Jeff
> >
> > On Thu, Sep 22, 2011 at 4:58 PM, Rajeev Thakur <thakur at mcs.anl.gov>
> wrote:
> >> In the new RMA, he could just call MPI_Win_create_dynamic once on
> comm_world and then locally attach memory to it using MPI_Win_attach. (And
> avoid using fence synchronization.)
> >>
> >> Rajeev
> >>
> >> On Sep 22, 2011, at 4:25 PM, Jeff Hammond wrote:
> >>
> >>> I work with someone who has a use case for nonblocking window creation
> >>> because can get into a deadlock situation unless he does a lot of
> >>> bookkeeping.  He's creating windows on subgroups of world that can
> >>> (will) overlap.  In order to prevent deadlock, he will have to do a
> >>> global collective and figure out how to order all of the window
> >>> creation calls so that they do not deadlock, or in the case where that
> >>> requires solving an NP-hard problem (it smells like the scheduling
> >>> problem to me) or requires too much storage to be practical (he works
> >>> at Juelich and regularly runs on 72 racks in VN mode), he will have to
> >>> serialize window creation globally.
> >>>
> >>> Nonblocking window creation and a waitall solves this problem.
> >>>
> >>> Thoughts?  I wonder if the semantics of nonblocking collectives -
> >>> which do not have tags - are even sufficient in the general case.
> >>>
> >>> Jeff
> >>>
> >>> --
> >>> Jeff Hammond
> >>> Argonne Leadership Computing Facility
> >>> University of Chicago Computation Institute
> >>> jhammond at alcf.anl.gov / (630) 252-5381
> >>> http://www.linkedin.com/in/jeffhammond
> >>> https://wiki.alcf.anl.gov/index.php/User:Jhammond
> >>> _______________________________________________
> >>> mpi3-rma mailing list
> >>> mpi3-rma at lists.mpi-forum.org
> >>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >>
> >>
> >> _______________________________________________
> >> mpi3-rma mailing list
> >> mpi3-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >>
> >
> >
> >
> > --
> > Jeff Hammond
> > Argonne Leadership Computing Facility
> > University of Chicago Computation Institute
> > jhammond at alcf.anl.gov / (630) 252-5381
> > http://www.linkedin.com/in/jeffhammond
> > https://wiki.alcf.anl.gov/index.php/User:Jhammond
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
> >
> > _______________________________________________
> > mpi3-rma mailing list
> > mpi3-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> >
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/index.php/User:Jhammond
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20110923/4c356a5d/attachment-0001.html>


More information about the mpiwg-rma mailing list