[Mpi3-rma] Alternative RMA discussion

Torsten Hoefler htor at cs.indiana.edu
Sun Nov 22 17:33:11 CST 2009

Hello all,

I think that the current RMA spec is not all that bad and that it's
possible to extend it slightly in order to meet our requirements and
maintain MPI's orthogonality and look&feel. I this sense, I follow
Jesper's and Hubert's proposals of small additions to the current spec
instead of a complete revamp. However, in any case, we need a much
better user-documentation because while I'm convinced that the standard
captures RMA semantics very well, I think that the complexity of the
topic is hard to grasp for an average user (who often thinks that
everything should be coherent etc.). That's one difference to ARMCI,
which just doesn't look as complex because it handles only a fraction of
the cases/platforms that MPI supports. 

The main issues that I address in the draft are:

1) we need local (point-to-point) windows 

2) we want to perform all RMA ops on arbitrary memory

3) simultaneous accesses to the same memory should not be illegal but
   rather undefined in their outcome (this is subtle but important)

4) we want test&set, test&accumulate, and compare&swap (and friends)

5) we want user-defined RMA operations

6) we want to take advantage of architectures that offer a stronger
   memory models and progression (coherent&consistent)

All those issues are orthogonal and can be discussed separately. I
propose changes in a draft chapter at
http://www.unixer.de/sec/one-side-2.pdf . This is of course the first
version and probably inconsistent, however, I think it solves all
problems discussed above. The semantics of the color-coding are:

- green: removed text from MPI-2.2
- red: added text
- blue: comments/discussion (not part of the chapter)

The issues mentioned above are addressed on the following pages:

1) 4-5
2) 26
3) 3,6,33-34
4) 14-15
5) 13-15
6) 37-38

Right now, all proposed changes are source-compatible to MPI-2.2 (at a
small price of elegance though).

I am sure there were reasons for all the restrictions on window/memory
access in MPI-2.0. However, I think that accepting undefined outcomes
should allow for efficient implementations. Error checking should be
left to tools on top of MPI (cf. other memory consistency models (who
knows omp_flush? ;) and threading).

>From a performance-perspective, there are two open things to discuss:

1) function arguments get expensive when they spill on the stack
 - RMA ops have horribly long argument lists
 - the number of args could be reduced if we assume symmetric
   count/datatype on both ends or if we send raw (MPI_BYTE) data (which
   I think it not preferable but might be unavoidable to match SHMEM or
   ARMCI performance)
2) supporting full MPI semantics comes at some cost, however, one 
   could special-case for performance reasons (e.g., make contiguous
   MPI_BYTE transfers a special case). This costs one or two branches in
   each of the RMA calls (imho not too bad). We could even endorse this
   parameter combination as some kind of fast-mode in the standard. 

>From a scalability standpoint, we might consider the following two

1) collective OP registration
 - ops can be identified by numbers or memory addresses (function
   pointers). However, if each process allocates them separately and
   sends them around, \Omega(P) memory could be required in the worst
   case. Collective allocation would reduce this to O(1)
2) collective window/memory allocation
 - potential \Omega(P) memory to store offsets; much discussed before,
   the solution is simple: allocate memory collectively (gives the
   library the chance to try to find "good addresses" (e.g., same local
   addresses or aligned/strided global addresses). 

I prepared a quick and incomplete implementation of the local window
proposal (which uses ARMCI for RMA accesses and falls back to MPI for
collective windows). The code is at http://www.unixer.de/sec/rma.cpp .

Please let me know what you think. I would not mind to drop this
completely if we decide that it's not a viable way to go (I only
invested two days so far).

Sorry for the long mail & All the Best,

 bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler       | Postdoctoral Fellow
Open Systems Lab      | Indiana University    
150 S. Woodlawn Ave.  | Bloomington, IN, 474045, USA
Lindley Hall Room 135 | +01 (812) 856-0501

More information about the mpiwg-rma mailing list