[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Mon Oct 14 16:40:16 CDT 2013

On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:

>Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>MPI_Win_create because the former allocates the shared memory
>business.  It was implied that the latter requires more work within
>the node. (I thought mmap could do the same magic on existing
>allocations, but that's not really the point here.)

Mmap unfortunately does no such magic.  In OMPI, the current design will
use XPMEM to do that magic for WIN_CREATE, or only create shared memory
windows when using MPI_WIN_ALLOCATE{_SHARED}.

>But within a node, what's even better than a window allocated with
>MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>since the latter permits load-store.  Then I wondered if it would be
>possible to have both (1) direct load-store access within a node and
>(2) scalable metadata for windows spanning many nodes.
>
>I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>dropping a second window for the internode part on top of these using
>MPI_Win_create.  Of course, I can get (2) but not (1) using
>MPI_Win_allocate.
>
>I propose that it be possible to get (1) and (2) by allowing
>MPI_Win_shared_query to return pointers to shared memory within a node
>even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>When the input argument rank to MPI_Win_shared_query corresponds to
>memory that is not accessible by load-store, the out arguments size
>and baseptr are 0 and NULL, respectively.

I like the concept and can see it's usefulness.  One concern I have is
that there is some overhead when doing native RDMA implementations of
windows if I'm combining that with shared memory semantics.  For example,
imagine a network that provides fast atomics by having a nic-side cache
that's non-coherent with e processor caches.  I can flush that cache at
the right times with the current interface, but that penalty is pretty
small because "the right times" is pretty small.  With two levels of
communication, the number of times that cache needs to be flushed is
increased, adding some small amount of overhead.

I think that overhead's ok if we have a way to request that specific
behavior, rather than asking after the fact if you can get shared pointers
out of a multi-node window.

>The non-scalable use of this feature would be to loop over all ranks
>in the group associated with the window and test for baseptr!=NULL,
>while the scalable use would presumably utilize MPI_Comm_split_type,
>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>ranks corresponding to the node, hence the ones that might permit
>direct access.

This brings up another questionŠ  0 is already a valid size.  What do we
do with FORTRAN for your proposed case?

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories