<div class="gmail_quote">On Sun, Sep 23, 2012 at 2:34 PM, N.M. Maclaren <span dir="ltr"><<a href="mailto:nmm1@cam.ac.uk" target="_blank">nmm1@cam.ac.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":7os">MPI does not specify that.  Both Fortran and C have mechanisms that can<br>

be used for inter-process synchronisation that do not involve calling MPI,<br>

and therefore will not call an MPI fence.  Writing to a file and reading<br>

the data is one classic one, and is heavily used.  I have seen data take<br>

5 seconds to get from one thread to another, which is ample time for I/O,<br>

and I have seen that logic cause this trouble with shared memory used by<br>

other forms of RDMA and synchronisation using file I/O.  And, yes, the<br>

RDMA did use a write fence.</div></blockquote><div><br></div><div>Obviously the read fence is the relevant issue here. Your example is now the following (cf. MPI-3 Example 11.9)</div><div><br></div><div><div>origin:</div>

<div>MPI_Win_create</div><div>MPI_Win_lock</div><div>MPI_Put</div><div>MPI_Win_unlock</div><div>notify(side_channel) // e.g., global variable in shared memory, memory-mapped serial line, file system</div><div><br></div><div>

target:</div><div>double buffer[10] = {0};</div><div>MPI_Win_create(buffer,....)</div><div>wait(side_channel) // e.g., spin</div><div>x = buffer[0]</div></div><div><br></div><div>It would have saved us a great deal of time if you had written this 30 messages ago, but in any case, we can make some observations.</div>

<div><br></div><div><div>1. If wait(side_channel) is a macro or inline function that the compiler can guarantee does not itself touch buffer, the compiler could reorder it with the read from buffer[]. This is the lack of sequence point that you were concerned with.</div>

<div><br></div><div>2. Even with a sequence point, some hardware (including POWER and SPARC) reorders independent loads, thus buffer[0] could be loaded before side_channel despite the instructions having the correct order.</div>

<div><br></div><div>3. Suppose there was a data dependency in the sense of</div><div><br></div><div>double *ptr = wait(side_channel);</div><div>x = ptr[0];</div><div><br></div><div>This is still not guaranteed to be correct on Alpha, which reorders DEPENDENT loads. For more details, see DATA DEPENDENCY BARRIERS in <a href="http://www.kernel.org/doc/Documentation/memory-barriers.txt">http://www.kernel.org/doc/Documentation/memory-barriers.txt</a> and Table 5 and Figure 10 of <a href="http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.14a.pdf">http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.14a.pdf</a>.</div>

</div><div><br></div><div>4. Obvious fixes include (a) don't communicate through the side channel and (b) protect the access as in</div><div><br></div><div><div>MPI_Win_lock</div><div>x = buffer[0]</div><div>MPI_Win_unlock</div>

</div><div><br></div><div>Note that many applications will use this anyway because it's onerous for the application to ensure that all passive mode RMA operations have completed (in the sense of MPI_Win_unlock on the target returning).</div>

<div><br></div><div>5. MPI-3 is more explicit about the memory model, providing MPI_WIN_UNIFIED and MPI_WIN_SEPARATE. In the latter, the direct access (without MPI_Win_lock/unlock or other synchronization such as MPI_Win_sync) is invalid. Read MPI-3 page 454. I believe your complaint can be summed up by the sentence "In the RMA unified memory model, an update by a put or accumulate call to a public window copy eventually becomes visible in the private copy in process memory without additional RMA calls." In this sentence, "eventually" roughly means "until a read memory fence is issued by the target, perhaps as a side-effect of some unrelated call". Since "eventually" could be a long time, some side-channel notification could allow access before the result was visible to the target process. Fortunately, "eventually" is an ambiguous term. ;-)</div>

<div><br></div><div>Rereading page 456, it could be more explicit about the possible requirement for user memory fences, especially since it could be necessary on some hardware independent of compiler optimization levels. Although the guidelines are somewhat loosely worded, the examples clarify. Note especially Example 11.9 which covers exactly the read ordering issue discussed here and Example 11.7 which deals with the converse.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":7os"><div class="im">

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

As a specific example, Fortran compilers can and do move arrays over<br>

procedure calls that do not appear to use them; C ones do not, but are<br>

in theory allowed to.<br>

</blockquote>

<br>

Passive-mode RMA is only compliant for memory allocated using<br>

MPI_Alloc_mem(). Since MPI_Alloc_mem() cannot be used portably by Fortran,<br>

passive-mode RMA is not portable for callers from vanilla Fortran. <br>

</blockquote>

<br></div>

That has been wrong since Fortran 2003, which provides C interoperability,<br>

including the ability to use buffers allocated in C. </div></blockquote><div><br></div><div>I was referring to the dialects of Fortran supported by the MPI standard prior to this week.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div id=":7os">

That is necessary but not sufficient, both in theory and practice.<br>

But, yes, active one-sided is semantically comparable to non-blocking.<br>

<br>

I am not going to be dragged into describing the signal handling fiasco,<br>

but I have seen what you claim to be unused used in two compilers.<br></div></blockquote><div><br></div><div>When I find compiler bugs, I report them. Can you point to the ticket where this issue was reported? Surely _someone_ was annoyed that the compiler was incapable of producing correct code for any multithreaded kernel, libpthread, database, or web browser...</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":7os">

Indeed, one of them triggered me into trying (and failing) to get SOME<br>

kind of semantics defined for volatile in WG14.</div></blockquote></div><div><br></div><div>Even with the current "specification", existing compilers are riddled with bugs related to volatile.</div><div><br></div>

<a href="http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf">http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf</a><br><a href="http://blog.regehr.org/archives/503">http://blog.regehr.org/archives/503</a><div>

<br></div><div>Worse, it's useless for what most people try to use it for.</div><div><br></div><div><a href="http://kernel.org/doc/Documentation/volatile-considered-harmful.txt">http://kernel.org/doc/Documentation/volatile-considered-harmful.txt</a></div>