[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Jeff Hammond jeff.science at gmail.com
Tue Oct 15 10:24:07 CDT 2013

On Tue, Oct 15, 2013 at 10:01 AM, Barrett, Brian W <bwbarre at sandia.gov> wrote:
> On 10/14/13 5:59 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>>On Mon, Oct 14, 2013 at 4:40 PM, Barrett, Brian W <bwbarre at sandia.gov>
>>> On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>>> I like the concept and can see it's usefulness.  One concern I have is
>>> that there is some overhead when doing native RDMA implementations of
>>> windows if I'm combining that with shared memory semantics.  For
>>> imagine a network that provides fast atomics by having a nic-side cache
>>> that's non-coherent with e processor caches.  I can flush that cache at
>>> the right times with the current interface, but that penalty is pretty
>>> small because "the right times" is pretty small.  With two levels of
>>> communication, the number of times that cache needs to be flushed is
>>> increased, adding some small amount of overhead.
>>How is this non-coherent NIC-side cache consistent with the UNIFIED
>>model, which is the only case in which shared-memory window semantics
>>are defined?  I am looking for a shortcut to the behavior of
>>overlapping windows where one of the windows is a shared-memory
>>window, so this is constrained to the UNIFIED model.
> You're right, my bad.

I'm framing this email :-)

>>> I think that overhead's ok if we have a way to request that specific
>>> behavior, rather than asking after the fact if you can get shared
>>> out of a multi-node window.
>>If there is a need to specify this, then an info key is sufficient,
>>no?  I would imagine some implementations provide it at no additional
>>cost and thus don't need the info key.
> Yes, I think an info key is sufficient.  My point is that I think you
> should have to request the dual-mode behavior, rather than it being the
> default.  Otherwise, the implementation is left guessing about whether the
> user is later going to try to use the shared memory features or not.

Yeah, I agree.  I expect SHMEM and ARMCI to use this but I imagine
most applications using MPI-3 directly won't implement the intranode
optimization (given how many other optimizations most codes fail to

>>>>The non-scalable use of this feature would be to loop over all ranks
>>>>in the group associated with the window and test for baseptr!=NULL,
>>>>while the scalable use would presumably utilize MPI_Comm_split_type,
>>>>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>>>>ranks corresponding to the node, hence the ones that might permit
>>>>direct access.
>>> This brings up another questionŠ  0 is already a valid size.  What do we
>>> do with FORTRAN for your proposed case?
>>I don't see what size has to do with this, but Pavan also pointed out
>>that Fortran is a problem.  Thus, my second suggestion would become a
>>requirement for usage, i.e. the user is only permitted to use
>>win_shared_query on ranks in the communicator returned by
> If size=0 wasn't valid, you could use that as a token to say there's
> nothing useful at this process.  But size=0 is valid, so we need another
> token.  And, just to be clear, Fortran sucks.

Okay, I'll leave it up to e.g. Bill to decide what sort of magic value
- new or old - is appropriate for this.  I really do not want to add a
new function that adds a flag to indicate "yes, you can use this


Jeff Hammond
jeff.science at gmail.com

More information about the mpiwg-rma mailing list