Is "The outcome of conflicting accesses to the same memory locations is
undefined" necessary because of the need to support cache incoherent

Most application codes that use GA/ARMCI require multiple gets
(read-only) and multiple accumulates within a single epoch.  Heck,
merely allowing unlimited read-only access would solve most of my

Is it possible for the standard to allow processes to expose read-only
windows which cannot be modified within a epoch, but for which
unlimited local and remote gets are permitted?


On Sun, Nov 22, 2009 at 5:33 PM, Torsten Hoefler <htor at cs.indiana.edu> wrote:
> Hello all,
> I think that the current RMA spec is not all that bad and that it's
> possible to extend it slightly in order to meet our requirements and
> maintain MPI's orthogonality and look&feel. I this sense, I follow
> Jesper's and Hubert's proposals of small additions to the current spec
> instead of a complete revamp. However, in any case, we need a much
> better user-documentation because while I'm convinced that the standard
> captures RMA semantics very well, I think that the complexity of the
> topic is hard to grasp for an average user (who often thinks that
> everything should be coherent etc.). That's one difference to ARMCI,
> which just doesn't look as complex because it handles only a fraction of
> the cases/platforms that MPI supports.
> The main issues that I address in the draft are:
> 1) we need local (point-to-point) windows
> 2) we want to perform all RMA ops on arbitrary memory
> 3) simultaneous accesses to the same memory should not be illegal but
>   rather undefined in their outcome (this is subtle but important)
> 4) we want test&set, test&accumulate, and compare&swap (and friends)
> 5) we want user-defined RMA operations
> 6) we want to take advantage of architectures that offer a stronger
>   memory models and progression (coherent&consistent)
> All those issues are orthogonal and can be discussed separately. I
> propose changes in a draft chapter at
> http://www.unixer.de/sec/one-side-2.pdf . This is of course the first
> version and probably inconsistent, however, I think it solves all
> problems discussed above. The semantics of the color-coding are:
> - green: removed text from MPI-2.2
> - red: added text
> - blue: comments/discussion (not part of the chapter)
> The issues mentioned above are addressed on the following pages:
> 1) 4-5
> 2) 26
> 3) 3,6,33-34
> 4) 14-15
> 5) 13-15
> 6) 37-38
> Right now, all proposed changes are source-compatible to MPI-2.2 (at a
> small price of elegance though).
> I am sure there were reasons for all the restrictions on window/memory
> access in MPI-2.0. However, I think that accepting undefined outcomes
> should allow for efficient implementations. Error checking should be
> left to tools on top of MPI (cf. other memory consistency models (who
> knows omp_flush? ;) and threading).
> >From a performance-perspective, there are two open things to discuss:
> 1) function arguments get expensive when they spill on the stack
>  - RMA ops have horribly long argument lists
>  - the number of args could be reduced if we assume symmetric
>   count/datatype on both ends or if we send raw (MPI_BYTE) data (which
>   I think it not preferable but might be unavoidable to match SHMEM or
>   ARMCI performance)
> 2) supporting full MPI semantics comes at some cost, however, one
>   could special-case for performance reasons (e.g., make contiguous
>   MPI_BYTE transfers a special case). This costs one or two branches in
>   each of the RMA calls (imho not too bad). We could even endorse this
>   parameter combination as some kind of fast-mode in the standard.
> >From a scalability standpoint, we might consider the following two
> additions:
> 1) collective OP registration
>  - ops can be identified by numbers or memory addresses (function
>   pointers). However, if each process allocates them separately and
>   sends them around, \Omega(P) memory could be required in the worst
>   case. Collective allocation would reduce this to O(1)
> 2) collective window/memory allocation
>  - potential \Omega(P) memory to store offsets; much discussed before,
>   the solution is simple: allocate memory collectively (gives the
>   library the chance to try to find "good addresses" (e.g., same local
>   addresses or aligned/strided global addresses).
> I prepared a quick and incomplete implementation of the local window
> proposal (which uses ARMCI for RMA accesses and falls back to MPI for
> collective windows). The code is at http://www.unixer.de/sec/rma.cpp .
> Please let me know what you think. I would not mind to drop this
> completely if we decide that it's not a viable way to go (I only
> invested two days so far).
> Sorry for the long mail & All the Best,
>  Torsten
