[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Thu Oct 17 04:31:05 CDT 2013

>>I use MPI RMA every day and it is the basis for NWChem on Blue Gene/Q
right now.  MPI-RMA was the basis for many successful scientific
simulations on Cray XE6 and Blue Gene/P with NWChem as well.

There is also http://pubs.acs.org/doi/abs/10.1021/ct200371n, which was
an application in biochemistry; we later rewrote that code to use
MPI-RMA directly rather than through ARMCI-MPI, which made it much
more efficient.<<

believe me, it can be done much better without mpi-rma on both platforms.
it is really sad that people still waste their time on developments like
that. wake up !

2013/10/15 Jeff Hammond <jeff.science at gmail.com>

> I use MPI RMA every day and it is the basis for NWChem on Blue Gene/Q
> right now.  MPI-RMA was the basis for many successful scientific
> simulations on Cray XE6 and Blue Gene/P with NWChem as well.
>
> There is also http://pubs.acs.org/doi/abs/10.1021/ct200371n, which was
> an application in biochemistry; we later rewrote that code to use
> MPI-RMA directly rather than through ARMCI-MPI, which made it much
> more efficient.
>
> It seems from
> http://lists.openfabrics.org/pipermail/ewg/2013-May/017872.html
> that you are affiliated with the GPI project.  It is sad to see that
> you're allergic to empirical evidence and resort to belligerent
> nonsense in an attempt to make your project relevant.
>
> This is at least the third time you've trolled this list
> [
> http://lists.mpi-forum.org/mpiwg-rma/2012/09/0861.php,http://lists.mpi-forum.org/mpiwg-rma/2013/06/1070.php
> ].
>  Please cease and desist immediately.
>
> Jeff
>
> On Tue, Oct 15, 2013 at 6:13 AM, maik peterson
> <maikpeterson at googlemail.com> wrote:
> > All these MPI_Win_X stuff has no benefit in practice. why do you care ?
> no
> > one
> > is using it, mp.
> >
> >
> >
> > 2013/10/14 Jeff Hammond <jeff.science at gmail.com>
> >>
> >> On Mon, Oct 14, 2013 at 4:40 PM, Barrett, Brian W <bwbarre at sandia.gov>
> >> wrote:
> >> > On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
> >> >
> >> >>Pavan told me that (in MPICH) MPI_Win_allocate is way better than
> >> >>MPI_Win_create because the former allocates the shared memory
> >> >>business.  It was implied that the latter requires more work within
> >> >>the node. (I thought mmap could do the same magic on existing
> >> >>allocations, but that's not really the point here.)
> >> >
> >> > Mmap unfortunately does no such magic.  In OMPI, the current design
> will
> >> > use XPMEM to do that magic for WIN_CREATE, or only create shared
> memory
> >> > windows when using MPI_WIN_ALLOCATE{_SHARED}.
> >>
> >> Okay, it seems Blue Gene/Q is the only awesome machine that allows for
> >> interprocess load-store for free (and not even "for 'free'").
> >>
> >> >>But within a node, what's even better than a window allocated with
> >> >>MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
> >> >>since the latter permits load-store.  Then I wondered if it would be
> >> >>possible to have both (1) direct load-store access within a node and
> >> >>(2) scalable metadata for windows spanning many nodes.
> >> >>
> >> >>I can get (1) but not (2) by using MPI_Win_allocate_shared and then
> >> >>dropping a second window for the internode part on top of these using
> >> >>MPI_Win_create.  Of course, I can get (2) but not (1) using
> >> >>MPI_Win_allocate.
> >> >>
> >> >>I propose that it be possible to get (1) and (2) by allowing
> >> >>MPI_Win_shared_query to return pointers to shared memory within a node
> >> >>even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
> >> >>When the input argument rank to MPI_Win_shared_query corresponds to
> >> >>memory that is not accessible by load-store, the out arguments size
> >> >>and baseptr are 0 and NULL, respectively.
> >> >
> >> > I like the concept and can see it's usefulness.  One concern I have is
> >> > that there is some overhead when doing native RDMA implementations of
> >> > windows if I'm combining that with shared memory semantics.  For
> >> > example,
> >> > imagine a network that provides fast atomics by having a nic-side
> cache
> >> > that's non-coherent with e processor caches.  I can flush that cache
> at
> >> > the right times with the current interface, but that penalty is pretty
> >> > small because "the right times" is pretty small.  With two levels of
> >> > communication, the number of times that cache needs to be flushed is
> >> > increased, adding some small amount of overhead.
> >>
> >> How is this non-coherent NIC-side cache consistent with the UNIFIED
> >> model, which is the only case in which shared-memory window semantics
> >> are defined?  I am looking for a shortcut to the behavior of
> >> overlapping windows where one of the windows is a shared-memory
> >> window, so this is constrained to the UNIFIED model.
> >>
> >> > I think that overhead's ok if we have a way to request that specific
> >> > behavior, rather than asking after the fact if you can get shared
> >> > pointers
> >> > out of a multi-node window.
> >>
> >> If there is a need to specify this, then an info key is sufficient,
> >> no?  I would imagine some implementations provide it at no additional
> >> cost and thus don't need the info key.
> >>
> >> >>The non-scalable use of this feature would be to loop over all ranks
> >> >>in the group associated with the window and test for baseptr!=NULL,
> >> >>while the scalable use would presumably utilize MPI_Comm_split_type,
> >> >>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
> >> >>ranks corresponding to the node, hence the ones that might permit
> >> >>direct access.
> >> >
> >> > This brings up another questionŠ  0 is already a valid size.  What do
> we
> >> > do with FORTRAN for your proposed case?
> >>
> >> I don't see what size has to do with this, but Pavan also pointed out
> >> that Fortran is a problem.  Thus, my second suggestion would become a
> >> requirement for usage, i.e. the user is only permitted to use
> >> win_shared_query on ranks in the communicator returned by
> >> MPI_Comm_split_type(type=SHARED).
> >>
> >> Jeff
> >>
> >> --
> >> Jeff Hammond
> >> jeff.science at gmail.com
> >> _______________________________________________
> >> mpiwg-rma mailing list
> >> mpiwg-rma at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> >
> >
> >
> > _______________________________________________
> > mpiwg-rma mailing list
> > mpiwg-rma at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20131017/c0ae2c6f/attachment-0001.html>