[mpiwg-rma] [MPI Forum] #437: MPI 3.0 Errata: MPI Active Target synchronization requirements for shared memory, 2 alternatives

Tue Aug 5 12:21:39 CDT 2014

(Small update below in case 4. It is split to 4AC and 4B.)

Dear Bill,

will you be at the Japan Forum meeting?

in our telecon - provided I understood correctly - you told that
you'll try to make a first proposal to solve the shared memory
problems. Please take this email only as a possible input for 
your proposal, i.e., it is not thought as a discussion base.

For me, it is important to finish this discussion more than two
weeks before the Japan meeting, because it must be part 
of the MPI-3.0 errata and MPI-3.1.

The two-week deadline is already Sep.1, 2014, i.e., 4 weeks from now!

-----
Important sections that may need corrections:

Sect.11.2.3

p409:16-17 mentions the first time
"The locally allocated memory can be the
target of load/store accesses by remote processes;"

Here we may mention:
The rules for RMA operation do not apply to these
these remote load/store accesses; additional rules apply, 
see, e.g., Section 11.5.4A [the new one below] and 
11.5.5 [on assertions] on pages ... and ... .  

p410:15-21: May need corrections

p436:37-48: Especially the first sentence - #436

How is this section related to shared Memory.
If there are additional rules for shared memory that
do not apply to all unified memory Windows,
then an additional section on "shared & unified" memory
need to be added.

I recommend to define it in the way 
"a load/store access in one process ... a load/store access in 
another process"
and not in the way of "local" and "remote load/store".
For me, after the MPI_Win_allocate_shared,
there is not really memory portions that are local to 
the processes.

The sentence p436:45-46 
"The order in which data is written is not
specified unless further synchronization is used."

If a process issues several load and stores to the 
same location, they are executed in the given sequence
without further need of synchronization

rank 0   rank 1
x=1      print x
x=2      print x

If two processes access the same location, may it possible
that the three prints may be other than 1-1 1-2 2-2? 
Do we want to keep this undefined?

I would propose to add a rationale like

---
Rationale. If two processes on a ccNUMA node access the same 
memory location with an intermediate process-to-process
synchronization then the outcome is still undefined
because the first memory operation need not to be 
finished before the second memory operation starts.
A well-defined execution may require the following sequence:
1) the memory access on the first process; 
2) a local memory fence to guarantee that the memory
   operation is finished;
3) the process-to-process synchronization (e.g. send/recv)
   to inform the second process; 
4) a local memory fence on the second process to guarantee
   that subsequent memory operations are issued
   to the memory
5) the memory operation on the second process.
To guarantee the outcome of memory accesses in combination
with the synchronization as described in Section 11.5.4A on
page ... (NEW SECTION, SEE BELOW), the synchronization operation
will issue such local memory fences if needed.
End of rationale.
---

This rationale makes it understandable, what we discussed
in such many emails about the need of MPI_WIN_SYNC.

The major part are the new synchronization rules:
I propose to put them together in one new section after 11.5.4

11.5.4A Shared Memory Synchronization

In the case of an MPI shared memory window additional rules
apply for synchronizing load and store accesses from several
processes to the same location. A location must be at least a byte.

In the following patterns, locations in a shared memory window
are noted with variable names A, B, C,
loads from such windows are noted with load(...) and stores 
are noted by assignments to these variables.

Patterns with active target communication and MPI_Win_sync:

     process P0          process P1

     A=val_1
     Sync-to-P1      --> Sync-from-P0
                         load(A)

     load(B)
     Sync-to-P1      --> Sync-from-P0
                         B=val_2

     C=val_3
     Sync-to-P1      --> Sync-from-P0
                         C=val_4
                         load (C)

with
     Sync-to-P1      --> Sync-from-P0
can be
1.   MPI_Win_fence   --> MPI_Win_fence   1)
2.   MPI_Win_post    --> MPI_Win_start   2)
3.   MPI_Win_complete--> MPI_Win_wait    3)
4AC. Only for the cases with variables 
       A (i.e., write-read) and 
       C (i.e,, write-write):
     MPI_Win_sync
     any process sync
       from P0 to P1 --> any process sync
                           from P0 to    4)
                         MPI_Win_sync
4B.  Only for the case with variable
       B (i.e., read-write rule) 
     any process sync
       from P0 to P1 --> any process sync
                           from P0 to    4)

Footnotes:
 1) MPI_Win_fence synchronizes in both directions and between
    every process in the process group of the window.
 2) The arrow means that P1 is in the origin group passed 
    to MPI_Win_post in P0, and that P0 is in the target 
    group passed to MPI_Win_start.
    Additional calls to MPI_Win_complete 
    (in P1 after MPI_Win_start) and MPI_Win_wait 
    (in P0 after MPI_Win_post) are need. The location of 
    these calls do not influence the guaranteed outcome rules.
 3) The arrow means that P1 is in the target group passed 
    to MPI_Win_start that corresponds to MPI_Win_complete in P0, 
    and P0 is in the origin group passed 
    to MPI_Win_post that corresponds to MPI_Win_wait in P1.
    Additional calls to MPI_Win_start 
    (in P0 before MPI_Win_complete) and MPI_Win_post 
    (in P1 before MPI_Win_wait) are need. The location of 
    these calls do not influence the guaranteed outcome rules.
 4) The synchronization may be done with methods from
    MPI (e.g. send-->recv) or with other methods.
    The requirements for using MPI_Win_sync (e.g. within
    a passive target epoch which may be provided with
    MPI_Win_lock_all) are not shown in thee pattern  

Patterns with lock/unlock synchronization:

Within passive target communication, two locks L1 and L2 may be
scheduled L1 before L2 or L2 before L1.
In the following patterns, the arrow means that the
lock in P0 was scheduled before the lock in P1.

For the following load/store/sync patterns

     process P0          process P1

     MPI_Win_lock
       exclusive 
     A=val_1
     MPI_Win_Unlock  --> MPI_Lock 
                           shared or
                           exclusive 
                         load(A)
                         MPI_Win_lock

     MPI_Win_lock
       shared or
       exclusive 
     load(B)
     MPI_Win_Unlock  --> MPI_Lock
                           exclusive 
                         B=val_2
                         MPI_Win_lock

     MPI_Win_lock
       exclusive
     C=val_3
     MPI_Win_Unlock  --> MPI_Lock
                           exclusive
                         C=val_4
                         load (C)
                         MPI_Win_lock

Note that each rank of a window is connected to a separate lock.
In a shared window, these locks are not connected to specific
memory portions of the shared window, i.e., each lock can be used
to protect any portion of a shared memory window.

In the patterns above, it is guaranteed 
 - that the load(A) in P1 loads val_1 (this is the Write-read-rule),
 - that the load(B) in P0 is not affected by the store 
   of val_2 in P1 (read-write-rule), and
 - that the load(C) in P1 loads val_4 (write-write-rule).

----
I do not understand which rules are guaranteed for 
- MPI_Win_flush (and MPI_Win_flush_all)
- MPI_Win_flush_local (and MPI_Win_flush_local_all)
I hope that somebody understands these calls and can produce the
related patterns. 
----

About Sect. 11.5.5 Assertions - MPI_MODE_NOSTORE:

I expect that the proposal in #429 is not helpful:
it enlarges the hint about "not updated by stores since 
last synchronization" from local stores to all stores by
the whole process group. The reason for this hint is to prevent 
"cache synchronization".

This cache synchronization is local and therefore the
remote stores do not count.

The new proposal is simple:

p452:1 and p452.9: "stores" --> "local stores"

Additionally, I would add the following sentence:
"In the case of a shared memory window, such local stores 
can be issued to any portion of the shared memory window." 

Reason: Nobody should think that "local" means "only to the
window portion that was defined in the local MPI_WIN_ALLOCATE_SHARED".

Best regards
Rolf

-- 
Dr. Rolf Rabenseifner . . . . . . . . . .. email rabenseifner at hlrs.de
High Performance Computing Center (HLRS) . phone ++49(0)711/685-65530
University of Stuttgart . . . . . . . . .. fax ++49(0)711 / 685-65832
Head of Dpmt Parallel Computing . . . www.hlrs.de/people/rabenseifner
Nobelstr. 19, D-70550 Stuttgart, Germany . . . . (Office: Room 1.307)