[Mpi-forum] Discussion points from the MPI-<next> discussion today

Sun Sep 23 16:30:54 CDT 2012

On Sun, Sep 23, 2012 at 2:34 PM, N.M. Maclaren <nmm1 at cam.ac.uk> wrote:

> MPI does not specify that.  Both Fortran and C have mechanisms that can
> be used for inter-process synchronisation that do not involve calling MPI,
> and therefore will not call an MPI fence.  Writing to a file and reading
> the data is one classic one, and is heavily used.  I have seen data take
> 5 seconds to get from one thread to another, which is ample time for I/O,
> and I have seen that logic cause this trouble with shared memory used by
> other forms of RDMA and synchronisation using file I/O.  And, yes, the
> RDMA did use a write fence.
>

Obviously the read fence is the relevant issue here. Your example is now
the following (cf. MPI-3 Example 11.9)

origin:
MPI_Win_create
MPI_Win_lock
MPI_Put
MPI_Win_unlock
notify(side_channel) // e.g., global variable in shared memory,
memory-mapped serial line, file system

target:
double buffer[10] = {0};
MPI_Win_create(buffer,....)
wait(side_channel) // e.g., spin
x = buffer[0]

It would have saved us a great deal of time if you had written this 30
messages ago, but in any case, we can make some observations.

1. If wait(side_channel) is a macro or inline function that the compiler
can guarantee does not itself touch buffer, the compiler could reorder it
with the read from buffer[]. This is the lack of sequence point that you
were concerned with.

2. Even with a sequence point, some hardware (including POWER and SPARC)
reorders independent loads, thus buffer[0] could be loaded before
side_channel despite the instructions having the correct order.

3. Suppose there was a data dependency in the sense of

double *ptr = wait(side_channel);
x = ptr[0];

This is still not guaranteed to be correct on Alpha, which reorders
DEPENDENT loads. For more details, see DATA DEPENDENCY BARRIERS in
http://www.kernel.org/doc/Documentation/memory-barriers.txt and Table 5 and
Figure 10 of
http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.14a.pdf.

4. Obvious fixes include (a) don't communicate through the side channel and
(b) protect the access as in

MPI_Win_lock
x = buffer[0]
MPI_Win_unlock

Note that many applications will use this anyway because it's onerous for
the application to ensure that all passive mode RMA operations have
completed (in the sense of MPI_Win_unlock on the target returning).

5. MPI-3 is more explicit about the memory model, providing MPI_WIN_UNIFIED
and MPI_WIN_SEPARATE. In the latter, the direct access (without
MPI_Win_lock/unlock or other synchronization such as MPI_Win_sync) is
invalid. Read MPI-3 page 454. I believe your complaint can be summed up by
the sentence "In the RMA unified memory model, an update by a put or
accumulate call to a public window copy eventually becomes visible in the
private copy in process memory without additional RMA calls." In this
sentence, "eventually" roughly means "until a read memory fence is issued
by the target, perhaps as a side-effect of some unrelated call". Since
"eventually" could be a long time, some side-channel notification could
allow access before the result was visible to the target process.
Fortunately, "eventually" is an ambiguous term. ;-)

Rereading page 456, it could be more explicit about the possible
requirement for user memory fences, especially since it could be necessary
on some hardware independent of compiler optimization levels. Although the
guidelines are somewhat loosely worded, the examples clarify. Note
especially Example 11.9 which covers exactly the read ordering issue
discussed here and Example 11.7 which deals with the converse.

>
>  As a specific example, Fortran compilers can and do move arrays over
>>> procedure calls that do not appear to use them; C ones do not, but are
>>> in theory allowed to.
>>>
>>
>> Passive-mode RMA is only compliant for memory allocated using
>> MPI_Alloc_mem(). Since MPI_Alloc_mem() cannot be used portably by Fortran,
>> passive-mode RMA is not portable for callers from vanilla Fortran.
>>
>
> That has been wrong since Fortran 2003, which provides C interoperability,
> including the ability to use buffers allocated in C.
>

I was referring to the dialects of Fortran supported by the MPI standard
prior to this week.

That is necessary but not sufficient, both in theory and practice.
> But, yes, active one-sided is semantically comparable to non-blocking.
>
> I am not going to be dragged into describing the signal handling fiasco,
> but I have seen what you claim to be unused used in two compilers.
>

When I find compiler bugs, I report them. Can you point to the ticket where
this issue was reported? Surely _someone_ was annoyed that the compiler was
incapable of producing correct code for any multithreaded kernel,
libpthread, database, or web browser...

Indeed, one of them triggered me into trying (and failing) to get SOME
> kind of semantics defined for volatile in WG14.
>

Even with the current "specification", existing compilers are riddled with
bugs related to volatile.

http://www.cs.utah.edu/~regehr/papers/emsoft08-preprint.pdf
http://blog.regehr.org/archives/503

Worse, it's useless for what most people try to use it for.

http://kernel.org/doc/Documentation/volatile-considered-harmful.txt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpi-forum/attachments/20120923/2fd556be/attachment-0001.html>