[Mpi3-rma] Alternative RMA discussion

Sun Dec 13 00:28:56 CST 2009

Torsten,
        Thanks for taking the initiative in looking into whether the current spec can be modified to meet the desired requirements.
Some comments:

* If MPI_RMA_QUERY indicates that the system is cache coherent, it reduces the need for the user to call synchronization functions,
particularly if we leave it up to the application to know when the target memory is ready to be accessed. Since one of the
criticisms of the current interface is too many synchronization functions, it may be worth looking into whether the synchronization
requirements can be relaxed in some way in the cache coherent case.

* In Dan Bonachea's paper on why MPI-2 RMA is not useful for implementing PGAS languages
(www.eecs.berkeley.edu/~bonachea/upc/bonachea-duell-mpi.pdf), he says that he can only use passive-target RMA because the target
cannot be expected to participate in the RMA. One of his complaints is that lock-unlock must be called separately for each target
process, which serializes accesses to multiple targets. This gets back to the issue of synchronization requirements.

* In the MPI_Win_create_local case, how would you communicate the MPI_Win object? It would need a new MPI_WIN datatype I think.
Also, does MPI_Win_create_local need a communicator argument?

* On pg 9, ln 40 it should be MPI_Get instead of Put. Similarly on pg 13, ln 20.

Rajeev  

> -----Original Message-----
> From: mpi3-rma-bounces at lists.mpi-forum.org 
> [mailto:mpi3-rma-bounces at lists.mpi-forum.org] On Behalf Of 
> Torsten Hoefler
> Sent: Sunday, November 22, 2009 5:33 PM
> To: mpi3-rma at lists.mpi-forum.org
> Subject: [Mpi3-rma] Alternative RMA discussion
> 
> Hello all,
> 
> I think that the current RMA spec is not all that bad and 
> that it's possible to extend it slightly in order to meet our 
> requirements and maintain MPI's orthogonality and look&feel. 
> I this sense, I follow Jesper's and Hubert's proposals of 
> small additions to the current spec instead of a complete 
> revamp. However, in any case, we need a much better 
> user-documentation because while I'm convinced that the 
> standard captures RMA semantics very well, I think that the 
> complexity of the topic is hard to grasp for an average user 
> (who often thinks that everything should be coherent etc.). 
> That's one difference to ARMCI, which just doesn't look as 
> complex because it handles only a fraction of the 
> cases/platforms that MPI supports. 
> 
> The main issues that I address in the draft are:
> 
> 1) we need local (point-to-point) windows 
> 
> 2) we want to perform all RMA ops on arbitrary memory
> 
> 3) simultaneous accesses to the same memory should not be illegal but
>    rather undefined in their outcome (this is subtle but important)
> 
> 4) we want test&set, test&accumulate, and compare&swap (and friends)
> 
> 5) we want user-defined RMA operations
> 
> 6) we want to take advantage of architectures that offer a stronger
>    memory models and progression (coherent&consistent)
> 
> All those issues are orthogonal and can be discussed 
> separately. I propose changes in a draft chapter at 
> http://www.unixer.de/sec/one-side-2.pdf . This is of course 
> the first version and probably inconsistent, however, I think 
> it solves all problems discussed above. The semantics of the 
> color-coding are:
> 
> - green: removed text from MPI-2.2
> - red: added text
> - blue: comments/discussion (not part of the chapter)
> 
> The issues mentioned above are addressed on the following pages:
> 
> 1) 4-5
> 2) 26
> 3) 3,6,33-34
> 4) 14-15
> 5) 13-15
> 6) 37-38
> 
> Right now, all proposed changes are source-compatible to 
> MPI-2.2 (at a small price of elegance though).
> 
> I am sure there were reasons for all the restrictions on 
> window/memory access in MPI-2.0. However, I think that 
> accepting undefined outcomes should allow for efficient 
> implementations. Error checking should be left to tools on 
> top of MPI (cf. other memory consistency models (who knows 
> omp_flush? ;) and threading).
> 
> >From a performance-perspective, there are two open things to discuss:
> 
> 1) function arguments get expensive when they spill on the stack
>  - RMA ops have horribly long argument lists
>  - the number of args could be reduced if we assume symmetric
>    count/datatype on both ends or if we send raw (MPI_BYTE) 
> data (which
>    I think it not preferable but might be unavoidable to 
> match SHMEM or
>    ARMCI performance)
> 2) supporting full MPI semantics comes at some cost, however, one 
>    could special-case for performance reasons (e.g., make contiguous
>    MPI_BYTE transfers a special case). This costs one or two 
> branches in
>    each of the RMA calls (imho not too bad). We could even 
> endorse this
>    parameter combination as some kind of fast-mode in the standard. 
> 
> >From a scalability standpoint, we might consider the following two
> additions:
> 
> 1) collective OP registration
>  - ops can be identified by numbers or memory addresses (function
>    pointers). However, if each process allocates them separately and
>    sends them around, \Omega(P) memory could be required in the worst
>    case. Collective allocation would reduce this to O(1)
> 2) collective window/memory allocation
>  - potential \Omega(P) memory to store offsets; much discussed before,
>    the solution is simple: allocate memory collectively (gives the
>    library the chance to try to find "good addresses" (e.g., 
> same local
>    addresses or aligned/strided global addresses). 
> 
> I prepared a quick and incomplete implementation of the local 
> window proposal (which uses ARMCI for RMA accesses and falls 
> back to MPI for collective windows). The code is at 
> http://www.unixer.de/sec/rma.cpp .
> 
> Please let me know what you think. I would not mind to drop 
> this completely if we decide that it's not a viable way to go 
> (I only invested two days so far).
> 
> Sorry for the long mail & All the Best,
>   Torsten