[mpiwg-rma] [EXTERNAL] shared-like access within a node with non-shared windows

Barrett, Brian W bwbarre at sandia.gov
Tue Oct 15 10:01:01 CDT 2013


On 10/14/13 5:59 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:

>On Mon, Oct 14, 2013 at 4:40 PM, Barrett, Brian W <bwbarre at sandia.gov>
>wrote:
>> On 10/12/13 1:49 PM, "Jeff Hammond" <jeff.science at gmail.com> wrote:
>>
>> I like the concept and can see it's usefulness.  One concern I have is
>> that there is some overhead when doing native RDMA implementations of
>> windows if I'm combining that with shared memory semantics.  For
>>example,
>> imagine a network that provides fast atomics by having a nic-side cache
>> that's non-coherent with e processor caches.  I can flush that cache at
>> the right times with the current interface, but that penalty is pretty
>> small because "the right times" is pretty small.  With two levels of
>> communication, the number of times that cache needs to be flushed is
>> increased, adding some small amount of overhead.
>
>How is this non-coherent NIC-side cache consistent with the UNIFIED
>model, which is the only case in which shared-memory window semantics
>are defined?  I am looking for a shortcut to the behavior of
>overlapping windows where one of the windows is a shared-memory
>window, so this is constrained to the UNIFIED model.

You're right, my bad.

>> I think that overhead's ok if we have a way to request that specific
>> behavior, rather than asking after the fact if you can get shared
>>pointers
>> out of a multi-node window.
>
>If there is a need to specify this, then an info key is sufficient,
>no?  I would imagine some implementations provide it at no additional
>cost and thus don't need the info key.

Yes, I think an info key is sufficient.  My point is that I think you
should have to request the dual-mode behavior, rather than it being the
default.  Otherwise, the implementation is left guessing about whether the
user is later going to try to use the shared memory features or not.

>>>The non-scalable use of this feature would be to loop over all ranks
>>>in the group associated with the window and test for baseptr!=NULL,
>>>while the scalable use would presumably utilize MPI_Comm_split_type,
>>>MPI_Comm_group and MPI_Group_translate_ranks to determine the list of
>>>ranks corresponding to the node, hence the ones that might permit
>>>direct access.
>>
>> This brings up another questionŠ  0 is already a valid size.  What do we
>> do with FORTRAN for your proposed case?
>
>I don't see what size has to do with this, but Pavan also pointed out
>that Fortran is a problem.  Thus, my second suggestion would become a
>requirement for usage, i.e. the user is only permitted to use
>win_shared_query on ranks in the communicator returned by
>MPI_Comm_split_type(type=SHARED).

If size=0 wasn't valid, you could use that as a token to say there's
nothing useful at this process.  But size=0 is valid, so we need another
token.  And, just to be clear, Fortran sucks.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories






More information about the mpiwg-rma mailing list