[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Tue Oct 15 06:24:53 CDT 2013

I use MPI RMA every day and it is the basis for NWChem on Blue Gene/Q
right now.  MPI-RMA was the basis for many successful scientific
simulations on Cray XE6 and Blue Gene/P with NWChem as well.

There is also http://pubs.acs.org/doi/abs/10.1021/ct200371n, which was
an application in biochemistry; we later rewrote that code to use
MPI-RMA directly rather than through ARMCI-MPI, which made it much
more efficient.

It seems from http://lists.openfabrics.org/pipermail/ewg/2013-May/017872.html
that you are affiliated with the GPI project.  It is sad to see that
you're allergic to empirical evidence and resort to belligerent
nonsense in an attempt to make your project relevant.

This is at least the third time you've trolled this list
[http://lists.mpi-forum.org/mpiwg-rma/2012/09/0861.php,http://lists.mpi-forum.org/mpiwg-rma/2013/06/1070.php].
 Please cease and desist immediately.

Jeff

On Tue, Oct 15, 2013 at 6:13 AM, maik peterson
<maikpeterson at googlemail.com> wrote:
> All these MPI_Win_X stuff has no benefit in practice. why do you care ? no
> one
> is using it, mp.
>
>
>
> 2013/10/14 Jeff Hammond <jeff.science at gmail.com>
>>
>> On Mon, Oct 14, 2013 at 4:40 PM, Barrett, Brian W <bwbarre at sandia.gov>
>> wrote:
>> > On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>> >
>> >>Pavan told me that (in MPICH) MPI_Win_allocate is way better than
>> >>MPI_Win_create because the former allocates the shared memory
>> >>business.  It was implied that the latter requires more work within
>> >>the node. (I thought mmap could do the same magic on existing
>> >>allocations, but that's not really the point here.)
>> >
>> > Mmap unfortunately does no such magic.  In OMPI, the current design will
>> > use XPMEM to do that magic for WIN_CREATE, or only create shared memory
>> > windows when using MPI_WIN_ALLOCATE{_SHARED}.
>>
>> Okay, it seems Blue Gene/Q is the only awesome machine that allows for
>> interprocess load-store for free (and not even "for 'free'").
>>
>> >>But within a node, what's even better than a window allocated with
>> >>MPI_Win_allocate is a window allowed with MPI_Win_allocate_shared,
>> >>since the latter permits load-store.  Then I wondered if it would be
>> >>possible to have both (1) direct load-store access within a node and
>> >>(2) scalable metadata for windows spanning many nodes.
>> >>
>> >>I can get (1) but not (2) by using MPI_Win_allocate_shared and then
>> >>dropping a second window for the internode part on top of these using
>> >>MPI_Win_create.  Of course, I can get (2) but not (1) using
>> >>MPI_Win_allocate.
>> >>
>> >>I propose that it be possible to get (1) and (2) by allowing
>> >>MPI_Win_shared_query to return pointers to shared memory within a node
>> >>even if the window has MPI_WIN_CREATE_FLAVOR=MPI_WIN_FLAVOR_ALLOCATE.
>> >>When the input argument rank to MPI_Win_shared_query corresponds to
>> >>memory that is not accessible by load-store, the out arguments size
>> >>and baseptr are 0 and NULL, respectively.
>> >
>> > I like the concept and can see it's usefulness.  One concern I have is
>> > that there is some overhead when doing native RDMA implementations of
>> > windows if I'm combining that with shared memory semantics.  For
>> > example,
>> > imagine a network that provides fast atomics by having a nic-side cache
>> > that's non-coherent with e processor caches.  I can flush that cache at
>> > the right times with the current interface, but that penalty is pretty
>> > small because "the right times" is pretty small.  With two levels of
>> > communication, the number of times that cache needs to be flushed is
>> > increased, adding some small amount of overhead.
>>
>> How is this non-coherent NIC-side cache consistent with the UNIFIED
>> model, which is the only case in which shared-memory window semantics
>> are defined?  I am looking for a shortcut to the behavior of
>> overlapping windows where one of the windows is a shared-memory
>> window, so this is constrained to the UNIFIED model.
>>
>> > I think that overhead's ok if we have a way to request that specific
>> > behavior, rather than asking after the fact if you can get shared
>> > pointers
>> > out of a multi-node window.
>>
>> If there is a need to specify this, then an info key is sufficient,
>> no?  I would imagine some implementations provide it at no additional
>> cost and thus don't need the info key.
>>
>> >>The non-scalable use of this feature would be to loop over all ranks
>> >>in the group associated with the window and test for baseptr!=NULL,
>> >>while the scalable use would presumably utilize MPI_Comm_split_type,
>> >>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>> >>ranks corresponding to the node, hence the ones that might permit
>> >>direct access.
>> >
>> > This brings up another questionŠ  0 is already a valid size.  What do we
>> > do with FORTRAN for your proposed case?
>>
>> I don't see what size has to do with this, but Pavan also pointed out
>> that Fortran is a problem.  Thus, my second suggestion would become a
>> requirement for usage, i.e. the user is only permitted to use
>> win_shared_query on ranks in the communicator returned by
>> MPI_Comm_split_type(type=SHARED).
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
>
>
>
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

-- 
Jeff Hammond
jeff.science at gmail.com