MPI Forum, March 2011.  RMA chapter first reading.

General Notes:
~~~~~~~~~~~~~

* We've modified lock/unlock to allow an origin process to lock the same window at multiple targets.  We need to make sure there is text for this since it is different from the MPI 2 semantics.  Also, we need to make sure the old text forbidding this is gone.  Are we allowing both shared and exclusive locks?  (I assume yes)

* In terms of Accumulate operation same_op compatibility, CAS is defined to be its own operation.  We need text to make this clear.  It would be an easy mistake to interpret CAS as being an MPI_REPLACE operation.

* We need a request-based operation example.  I've volunteered to put something together.


Specific Notes:
~~~~~~~~~~~~~~

* Pg. 1-2, Lines 42+, 1 -- For the bulleted list, the category should be given before the list of operations.  E.g. "Remote write: MPI_PUT, MPI_RPUT."

* Pg. 27, Fig. 11.1 -- There was a request to make this figure simpler or add another simpler figure.  IMO, the figure is complicated but highlights an important issue.  Re: caption, is this picture really limited to the MPI_WIN_SEPARATE model?  Also, it seems like the "process memory" box also represents cache and registers.

* Pg. 28, Lines 39-41 -- I have a note to update this text.  Don't remember what the issue was, is it something to do with UNIFED vs. SEPARATE?

* Pg. 45, Lines 13-15 -- This text has two scenarios entangled with each other.  Replace with "The behavior of MPI RMA operations may be /undefined/ in some situations.  For example, the result of several origin processes performing concurrent MPI_PUT operations to the same target location is undefined.  In addition, the result of a single origin process performing multiple MPI_PUT operations to the same target location within the same access epoch is also undefined."

* Pg. 45, Line 34 -- Replace "are erroneous" with "are detected and reported to the user."

* Pg. 46, Line 46 -- Replace "local memory" with "a window"

* Pg. 46, Lines 39+ -- #3: This is assuming the MPI implementation will never touch bytes that the user has not asked to be updated.  Is this reasonable?  Consider an optimized memcpy for intra-node transfer that uses SIMD copy that spills over and then corrects with non-SIMD instructions.  Is this an invalid implementation?

* Pg. 48, Ex. 11.7 -- Insert an MPI_Sync() or store_fence() before the barrier on Process B.

* Pg. 51, Line 32 -- Can an MPI_Sync() before the load X in Process B also correct this?

* Pg. 51-53, Ex. 11.13-11.15 -- Move to examples section

* Pg. 52, Line 11 -- Replace while loop with a do ... while (z != 0)

* Pg. 53, Line 28-30 -- Replace "On the other hand, ..." with "Concurrent accumulate operations with different origin and target pairs are not ordered."

* Pg. 53, Line 46 -- The discussion of "program" order should mention that this applies to operations issued within the same eopch.