[Mpi3-rma] Memory barriers in MPI_WIN_LOCK_ALL mode

Tue Oct 30 08:26:00 CDT 2012

On Tue, Oct 30, 2012 at 08:48:47AM -0500, Pavan Balaji wrote:
>
> On 10/30/2012 07:28 AM, Torsten Hoefler wrote:
>>> If the MPI_GET on P1 does a local load operation internally, it is not
>>> memory consistent without a memory barrier.  Does this mean that I need
>>> to always do a memory barrier on all local GET operations to get the
>>> right value?
>> Well, if the put came from shared memory, then the flush would most
>> likely do a memory barrier (depending on the consistency model of the
>> architecture, i.e., in TSO (x86) it would need an mfence). If the put
>> came from remote memory then there should not be a problem because the
>> flush must block until the data is visible to the CPU (assuming the
>> unified memory model, if you were in separate, you'd need an additional
>> win_synch).
>
> Actually, the flush only needs to block till its visible to another MPI  
> RMA operation, not to the CPU (as in, not to load/store). 
In the unified memory model, it has to guarantee visibility to
load/store operations as well.

> I don't think  the target process can guarantee that the PUT is
> visible to a load/store  without an additional memory barrier.  
The flush of the source process has to ensure that.

> Since in the MPI standard, we  don't specify that the user needs to
> call a WIN_SYNC in this case, I'm  asking if the MPI implementation
> needs to do a memory barrier internally.
Yes, win_synch is only needed in the separate model, which would then do
a memory barrier. The MPI library has to guarantee that there is no
inconsistency between public and private copy in the unified model (or
it cannot claim that it supports unified).

>>> Note that this is likely only a theoretical exercise, since most (all?)
>>> compilers will do a memory barrier anyway if they see a function call
>>> (MPI_GET in this case).  But is MPI assuming that that's going to be the
>>> case for efficient execution?
>> Really? Which compiler puts a memory barrier before function calls? This
>> sounds rather inefficient to me (in the days of fastcall).
>
> Sorry, I meant it'll not reorder operations across function calls,  
> rather than a memory barrier.
Ah, right!

Best,
  Torsten

-- 
### qreharg rug ebs fv crryF ------------- http://www.unixer.de/ -----
Torsten Hoefler           | Assistant Professor
Dept. of Computer Science | ETH Zürich
Universitätsstrasse 6     | Zurich-8092, Switzerland
CAB E 64.1                | Phone: +41 76 309 79 29