[Mpi3-rma] Alternative RMA discussion

Tue Nov 24 17:58:11 CST 2009

Is "The outcome of conﬂicting accesses to the same memory locations is
undefined" necessary because of the need to support cache incoherent
architectures?

Most application codes that use GA/ARMCI require multiple gets
(read-only) and multiple accumulates within a single epoch.  Heck,
merely allowing unlimited read-only access would solve most of my
problems.

Is it possible for the standard to allow processes to expose read-only
windows which cannot be modified within a epoch, but for which
unlimited local and remote gets are permitted?

Jeff

On Sun, Nov 22, 2009 at 5:33 PM, Torsten Hoefler <htor at cs.indiana.edu> wrote:
> Hello all,
>
> I think that the current RMA spec is not all that bad and that it's
> possible to extend it slightly in order to meet our requirements and
> maintain MPI's orthogonality and look&feel. I this sense, I follow
> Jesper's and Hubert's proposals of small additions to the current spec
> instead of a complete revamp. However, in any case, we need a much
> better user-documentation because while I'm convinced that the standard
> captures RMA semantics very well, I think that the complexity of the
> topic is hard to grasp for an average user (who often thinks that
> everything should be coherent etc.). That's one difference to ARMCI,
> which just doesn't look as complex because it handles only a fraction of
> the cases/platforms that MPI supports.
>
> The main issues that I address in the draft are:
>
> 1) we need local (point-to-point) windows
>
> 2) we want to perform all RMA ops on arbitrary memory
>
> 3) simultaneous accesses to the same memory should not be illegal but
>   rather undefined in their outcome (this is subtle but important)
>
> 4) we want test&set, test&accumulate, and compare&swap (and friends)
>
> 5) we want user-defined RMA operations
>
> 6) we want to take advantage of architectures that offer a stronger
>   memory models and progression (coherent&consistent)
>
> All those issues are orthogonal and can be discussed separately. I
> propose changes in a draft chapter at
> http://www.unixer.de/sec/one-side-2.pdf . This is of course the first
> version and probably inconsistent, however, I think it solves all
> problems discussed above. The semantics of the color-coding are:
>
> - green: removed text from MPI-2.2
> - red: added text
> - blue: comments/discussion (not part of the chapter)
>
> The issues mentioned above are addressed on the following pages:
>
> 1) 4-5
> 2) 26
> 3) 3,6,33-34
> 4) 14-15
> 5) 13-15
> 6) 37-38
>
> Right now, all proposed changes are source-compatible to MPI-2.2 (at a
> small price of elegance though).
>
> I am sure there were reasons for all the restrictions on window/memory
> access in MPI-2.0. However, I think that accepting undefined outcomes
> should allow for efficient implementations. Error checking should be
> left to tools on top of MPI (cf. other memory consistency models (who
> knows omp_flush? ;) and threading).
>
> >From a performance-perspective, there are two open things to discuss:
>
> 1) function arguments get expensive when they spill on the stack
>  - RMA ops have horribly long argument lists
>  - the number of args could be reduced if we assume symmetric
>   count/datatype on both ends or if we send raw (MPI_BYTE) data (which
>   I think it not preferable but might be unavoidable to match SHMEM or
>   ARMCI performance)
> 2) supporting full MPI semantics comes at some cost, however, one
>   could special-case for performance reasons (e.g., make contiguous
>   MPI_BYTE transfers a special case). This costs one or two branches in
>   each of the RMA calls (imho not too bad). We could even endorse this
>   parameter combination as some kind of fast-mode in the standard.
>
> >From a scalability standpoint, we might consider the following two
> additions:
>
> 1) collective OP registration
>  - ops can be identified by numbers or memory addresses (function
>   pointers). However, if each process allocates them separately and
>   sends them around, \Omega(P) memory could be required in the worst
>   case. Collective allocation would reduce this to O(1)
> 2) collective window/memory allocation
>  - potential \Omega(P) memory to store offsets; much discussed before,
>   the solution is simple: allocate memory collectively (gives the
>   library the chance to try to find "good addresses" (e.g., same local
>   addresses or aligned/strided global addresses).
>
> I prepared a quick and incomplete implementation of the local window
> proposal (which uses ARMCI for RMA accesses and falls back to MPI for
> collective windows). The code is at http://www.unixer.de/sec/rma.cpp .
>
> Please let me know what you think. I would not mind to drop this
> completely if we decide that it's not a viable way to go (I only
> invested two days so far).
>
> Sorry for the long mail & All the Best,
>  Torsten
>
> --
>  bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
> Torsten Hoefler       | Postdoctoral Fellow
> Open Systems Lab      | Indiana University
> 150 S. Woodlawn Ave.  | Bloomington, IN, 474045, USA
> Lindley Hall Room 135 | +01 (812) 856-0501
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
http://home.uchicago.edu/~jhammond/