[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Mon Oct 14 16:59:49 CDT 2013

On Mon, Oct 14, 2013 at 4:40 PM, Barrett, Brian W <bwbarre at sandia.gov> wrote:
> On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>
>>Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>>MPI_Win_create because the former allocates the shared memory
>>business.  It was implied that the latter requires more work within
>>the node. (I thought mmap could do the same magic on existing
>>allocations, but that's not really the point here.)
>
> Mmap unfortunately does no such magic.  In OMPI, the current design will
> use XPMEM to do that magic for WIN_CREATE, or only create shared memory
> windows when using MPI_WIN_ALLOCATE{_SHARED}.

Okay, it seems Blue Gene/Q is the only awesome machine that allows for
interprocess load-store for free (and not even "for 'free'").

>>But within a node, what's even better than a window allocated with
>>MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>>since the latter permits load-store.  Then I wondered if it would be
>>possible to have both (1) direct load-store access within a node and
>>(2) scalable metadata for windows spanning many nodes.
>>
>>I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>>dropping a second window for the internode part on top of these using
>>MPI_Win_create.  Of course, I can get (2) but not (1) using
>>MPI_Win_allocate.
>>
>>I propose that it be possible to get (1) and (2) by allowing
>>MPI_Win_shared_query to return pointers to shared memory within a node
>>even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>>When the input argument rank to MPI_Win_shared_query corresponds to
>>memory that is not accessible by load-store, the out arguments size
>>and baseptr are 0 and NULL, respectively.
>
> I like the concept and can see it's usefulness.  One concern I have is
> that there is some overhead when doing native RDMA implementations of
> windows if I'm combining that with shared memory semantics.  For example,
> imagine a network that provides fast atomics by having a nic-side cache
> that's non-coherent with e processor caches.  I can flush that cache at
> the right times with the current interface, but that penalty is pretty
> small because "the right times" is pretty small.  With two levels of
> communication, the number of times that cache needs to be flushed is
> increased, adding some small amount of overhead.

How is this non-coherent NIC-side cache consistent with the UNIFIED
model, which is the only case in which shared-memory window semantics
are defined?  I am looking for a shortcut to the behavior of
overlapping windows where one of the windows is a shared-memory
window, so this is constrained to the UNIFIED model.

> I think that overhead's ok if we have a way to request that specific
> behavior, rather than asking after the fact if you can get shared pointers
> out of a multi-node window.

If there is a need to specify this, then an info key is sufficient,
no?  I would imagine some implementations provide it at no additional
cost and thus don't need the info key.

>>The non-scalable use of this feature would be to loop over all ranks
>>in the group associated with the window and test for baseptr!=NULL,
>>while the scalable use would presumably utilize MPI_Comm_split_type,
>>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>>ranks corresponding to the node, hence the ones that might permit
>>direct access.
>
> This brings up another questionŠ  0 is already a valid size.  What do we
> do with FORTRAN for your proposed case?

I don't see what size has to do with this, but Pavan also pointed out
that Fortran is a problem.  Thus, my second suggestion would become a
requirement for usage, i.e. the user is only permitted to use
win_shared_query on ranks in the communicator returned by
MPI_Comm_split_type(type=SHARED).

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com