[mpiwg-rma] [EXTERNAL] Re: Synchronization on shared memory windows

Tue Feb 4 12:03:00 CST 2014

On 2/4/14 10:39 AM, "Rolf Rabenseifner" <rabenseifner at hlrs.de> wrote:

>Brian and all,
>
>No wording in the MPI-3.0 tells anything about the compiler problems.
>They are addressed by MPI_F_SYNC_REG (or work around with
>MPI_Get_address) 
>to be sure that the store instruction is issued before I do any
>synchronization calls.

Jeff already pointed out the text that's important here.  It's also
important to note that we don't explicitly say how to program shared
memory directly because it's not something we can standardize on because
of how platform, compiler, and language dependent it is.

>If I must do an additional MPI_WIN_SYNC, when must it be done?

It's hard to say, but generally any time you want 1) ordering of reads and
writes or 2) to force the compiler/language to re-issue a load.

>My example is simple:
>X is part of a shared memory window and should mean the same
>memory location in both processes
>
>Process A         Process B
>
>x=13
>MPI_F_SYNC_REG(X)
>MPI_Barrier       MPI_Barrier
>                  MPI_F_SYNC_REG(X)
>                  print X
>
>Where exactly do I need in which process an additional MPI_WIN_SYNC?

If you wanted to be absolutely sure that print X works, you would need a
MPI_WIN_SYNC after MPI_F_SYNC_REG on process A (to make sure the store
isn't hoisted after the barrier) and another MPI_WIN_SYNC before
MPI_F_SYNC_REG on process B (to make sure the load isn't hoisted to before
the barrier).  Note that the hoisting / reordering can occur at the
compiler or processor level and nothing in MPI_Barrier requires a
processor memory barrier, which is why it's not  to assume it will do the
right thing.

Now, on x86, your code will likely work as written because the processor
is pretty much memory ordered and the SYNC_REG / Barrier will keep the
compiler from moving the store or load.

>Which wording in the MPI-3.0 does tell this need?

None.  This is in the realm of undefined behavior for better performance
(because it's so processor/language/compiler specific).  Note that we do
define a mechanism you can use (put/get/synchronization), at a cost in
performance.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories