[Mpi3-rma] Updated proposal 1

Tue Feb 1 22:04:06 CST 2011

Hello All,

I updated the proposal as discussed in our last telecon (including
Rajeev's comments -- thanks!). Significant changes:

- moved all info keys to win_create
- added info key about operation
- changed ordering info key to be more flexible
- fixed examples
- added comment for Pavan to fix a comment
- changed the color of the proposal 2 merges (as they are still being
  discussed and need a better letter than "n").

The complete diff is attached to this email and the documents on the
wiki are updated. Please review!

https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/wiki/mpi3-rma-proposal1/

Thanks & All the Best,
  Torsten

-- 
 bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler         | Performance Modeling and Simulation Lead
Blue Waters Directorate | University of Illinois (UIUC)
1205 W Clark Street     | Urbana, IL, 61801
NCSA Building           | +01 (217) 244-7736
-------------- next part --------------
Index: one-side-2.tex
===================================================================

--- one-side-2.tex	(revision 65)
+++ one-side-2.tex	(working copy)
@@ -101,13 +115,13 @@
 update)\hadd{, \mpifunc{MPI\_GET\_ACCUMULATE},
 \mpifunc{MPI\_FETCH\_AND\_OP} (remote fetch and update), 
 \mpifunc{MPI\_COMPARE\_AND\_SWAP} (remote atomic swap
-operations), \mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
-  \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}.}
+operations), \padd{\mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
+  \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}}.}
 \gadd{When a reference is made to ``accumulate'' operations in the
   following, it refers to the following operations:
   \mpifunc{MPI\_ACCUMULATE}, \mpifunc{MPI\_GET\_ACCUMULATE},
   \mpifunc{MPI\_FETCH\_AND\_OP}, \mpifunc{MPI\_COMPARE\_AND\_SWAP},
-  \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}.}
+  \padd{\mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}}.}
 
 \hadd{\MPI/ supports two fundamentally different memory models. The
 first model makes no assumption about memory consistency and is
@@ -199,6 +213,7 @@
 
 %\subsection{\hadd{Collective Memory Window Creation}}
 \subsection{Window Creation}
+\label{chap:one-side-2:win_create}
 
 \begin{funcdef}{MPI\_WIN\_CREATE(base, size, disp\_unit, info, comm, win)}
 \funcarg{\IN}{base}{initial address of window (choice)}
@@ -261,9 +276,29 @@
 3-party communication, and \RMA/ can be implemented with no (less)
 asynchronous
 agent activity at this process.
+\haddbegin
+\item{\infokey{accumulate\_ordering}} --- if set to \constskip{none},
+then no ordering will be guaranteed for accumulate calls (see
+Section~\ref{sec:1sided-ordering}). The key can be set to a
+comma-separated list of required access orderings at the target. Allowed
+values in the comma-separated list are \constskip{rr}, \constskip{wr},
+\constskip{rw}, and \constskip{ww} for read-read, write-read,
+read-write, write-write ordering, respectively. For example, if only
+read-read and write-write ordering is required, then the value of the
+\infokey{accumulate\_ordering} key could be set to \constskip{rr,ww}.
+The order of values is not significant. If this info argument is not
+specified, then it defaults to \constskip{rr,rw,wrww}.
+\item{\infokey{accumulate\_ops}} --- if set to \constskip{same\_op},
+then the implementation will assume that all concurrent accumulate calls
+to the same target address will use the same operation. If set to
+\constskip{same\_op\_no\_op}, then the implementation will assume that
+all concurrent accumulate calls to the same target address will use the
+same operation or \mpiarg{MPI\_NO\_OP}. This can eliminate the need to
+protect access for certain operation types where the hardware can
+guarantee atomicity.
+\haddend
 \end{description}
 
-\gcomment{Should \infokey{accumulate\_ordering} be here as well?}
 \haddbegin
 
 \begin{users}
@@ -945,13 +980,13 @@
 \label{sec:onesided-putget}
 
 \MPI/ supports \hreplace{three}{the following} \RMA/ communication calls: \mpifunc{MPI\_PUT}
-\hreplace{transfers}{and \mpifunc{MPI\_RPUT} transfer} data from the
+\preplace{transfers}{and \mpifunc{MPI\_RPUT} transfer} data from the
 caller memory (origin) to the target memory;
-\mpifunc{MPI\_GET} \hreplace{transfers}{and \mpifunc{MPI\_RGET} transfer} data from the target memory to the caller
+\mpifunc{MPI\_GET} \preplace{transfers}{and \mpifunc{MPI\_RGET} transfer} data from the target memory to the caller
 memory;
-\hreplace{and}{} \mpifunc{MPI\_ACCUMULATE} \hreplace{updates}{and \mpifunc{MPI\_RACCUMULATE} update} locations in the target memory,
+\hreplace{and}{} \mpifunc{MPI\_ACCUMULATE} \preplace{updates}{and \mpifunc{MPI\_RACCUMULATE} update} locations in the target memory,
 e.g.\hadd{,} by adding to these locations values sent from the caller
-memory\hreplace{.}{; \mpifunc{MPI\_GET\_ACCUMULATE}, \mpifunc{MPI\_RGET\_ACCUMULATE} and
+memory\preplace{.}{; \mpifunc{MPI\_GET\_ACCUMULATE}, \mpifunc{MPI\_RGET\_ACCUMULATE} and
 \mpifunc{MPI\_FETCH\_AND\_OP} atomically return the data
 before the accumulate operation; and
 \mpifunc{MPI\_COMPARE\_AND\_SWAP} performs a remote compare and swap
@@ -962,7 +997,7 @@
 a subsequent {\em synchronization} call is issued by the caller on
 the involved window object.  These synchronization calls are described in
 Section~\ref{sec:1sided-sync}, page~\pageref{sec:1sided-sync}.
-\hadd{Transfers can also be completed with calls to flush routines, see
+\padd{Transfers can also be completed with calls to flush routines, see
 Section~\ref{sec:1sided-flush} for details. For the
 \mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
 \mpifunc{MPI\_RACCUMULATE}, and
@@ -1156,7 +1191,7 @@
 \end{implementors}
 
 
-\haddbegin
+\paddbegin
 
 \begin{funcdef}{MPI\_RPUT(origin\_addr, origin\_count,
     origin\_datatype, target\_rank, target\_disp, target\_count,
@@ -1221,7 +1256,7 @@
 operation might complete locally.
 \end{users}
 
-\haddend
+\paddend
 
 
 \subsection{Get}
@@ -1264,7 +1299,7 @@
 in the origin buffer. 
 
 
-\haddbegin
+\paddbegin
 
 \begin{funcdef}{MPI\_RGET(origin\_addr, origin\_count,
     origin\_datatype, target\_rank, target\_disp, target\_count, \\
@@ -1307,7 +1342,7 @@
 \mpifunc{MPI\_RGET} operation indicates that the data is available
 in the origin buffer.
 
-\haddend
+\paddend
 
 \subsection{Examples}
 \label{sec:1sided-example}
@@ -1536,7 +1571,7 @@
 %tricky for the user to decide which OP is available remotely.}
 
 \const{MPI\_REPLACE} can be used only in \mpifunc{MPI\_ACCUMULATE},
-\hreplace{}{ \mpifunc{MPI\_RACCUMULATE},
+\preplace{}{ \mpifunc{MPI\_RACCUMULATE},
   \mpifunc{MPI\_GET\_ACCUMULATE}, and
   \mpifunc{MPI\_RGET\_ACCUMULATE}} \gadd{, but }not in collective
 reduction operations\gdelete{,} such as \mpifunc{MPI\_REDUCE}.
@@ -1549,7 +1584,7 @@
 \end{users}
 
 
-\haddbegin
+\paddbegin
 
 \begin{funcdef}{MPI\_RACCUMULATE(origin\_addr, origin\_count, origin\_datatype, target\_rank, target\_disp, target\_count,
 target\_datatype, op, win, req)}
@@ -1584,9 +1619,9 @@
 
 Similar to \mpifunc{MPI\_ACCUMULATE}, except that it returns a request
 handle that can be waited or tested on.
-\gcomment{Reword to not end on ``on''.}
+\gcomment{Reword to not end on ``on'' -- Pavan does this.}
 
-\haddend
+\paddend
 
 
 \begin{example}{\rm
@@ -1718,7 +1753,8 @@
 %
 A new predefined operation, \const{MPI\_NO\_OP}, is defined.  
 It corresponds to the associative function $f(a,b) = a$; i.e., the current
-value in the target memory is returned in the result buffer at the origin.
+value in the target memory is returned in the result buffer at the
+origin, and no operation is performed on the target buffer.
 %
 \const{MPI\_NO\_OP} can be used only in \mpifunc{MPI\_GET\_ACCUMULATE}
 and \mpifunc{MPI\_FETCH\_AND\_OP},  not in \mpifunc{MPI\_ACCUMULATE} or
@@ -1734,6 +1770,7 @@
 have different constraints on concurrent updates.
 \end{users}
 
+\paddbegin
 
 \begin{funcdef}{MPI\_RGET\_ACCUMULATE(origin\_addr, origin\_count,
 origin\_datatype, result\_addr, result\_count, results\_datatype,
@@ -1773,8 +1810,8 @@
 Similar to \mpifunc{MPI\_GET\_ACCUMULATE}, except that it returns a
 request handle that can be waited or tested on.
 
+\paddend
 
-
 \subsubsection{Fetch and Op Function}
 \label{sec:1sided-fetchandop}
 
@@ -3719,12 +3756,8 @@
 operation once an update to that location has started, until the
 update becomes visible in the public window copy. There is one
 exception to this rule, in the case where the same variable is updated
-by two concurrent accumulates that use the same operation, with the same
-predefined datatype, on the same window. \hcomment{Brian and Torsten think
-that this is limiting the usefulness of the programming model
-significantly, can we relax this (remove the ``same op'' restriction?
-However, we have to be careful to still allow hardware optimizations of
-subsets of operations only. Can this be done?}
+by two concurrent accumulates \hreplace{that use the same operation, }with the same
+predefined datatype, on the same window. 
 \item
 A put or accumulate must not access a target window once a local update
 or a put or accumulate update to another (overlapping) target window
@@ -3880,11 +3913,13 @@
 Process A:                 Process B:
                            window location X
                             
+MPI_Win_lock_all()         MPI_Win_lock_all()
                            store X /* update to private&public copy of B */
                            MPI_Win_sync
 MPI_Barrier                MPI_Barrier
 MPI_Get(X) /* ok, read from window */
 MPI_Win_flush_local(B)
+MPI_Win_unlock_all()       MPI_Win_unlock_all()
 /* read value */
 \end{verbatim}
 
@@ -3939,13 +3974,14 @@
 \begin{verbatim}
 Process A:                 Process B:
                            window location X
-                          
+MPI_Win_lock_all()         MPI_Win_lock_all()
 MPI_Put(X) /* update to window */
 MPI_Win_flush(B)
 
 MPI_Barrier                MPI_Barrier
                            MPI_Win_sync
                            load X
+MPI_Win_unlock_all()       MPI_Win_unlock_all()
 \end{verbatim}
 Note that the private copy of X has been updated after the barrier.
 \end{example}
@@ -3977,6 +4013,7 @@
 Process A:                                Process B:
 window location X
 X=2
+MPI_Win_lock_all()                        MPI_Win_lock_all()
 MPI_Win_sync
 MPI_Barrier                               MPI_Barrier
                           
@@ -3988,7 +4025,7 @@
   MPI_Win_flush(A)                          MPI_Win_flush(A)
 done                                      done
 
-MPI_Barrier                               MPI_Barrier
+MPI_Win_unlock_all()                      MPI_Win_unlock_all()
 \end{verbatim}
 \end{example}
 
@@ -4010,6 +4047,7 @@
 window location X                         window location Y
 window location T
  
+MPI_Win_lock_all()                        MPI_Win_lock_all()
 X=1                                       Y=1    
 MPI_Win_sync                              MPI_Win_sync
 MPI_Barrier                               MPI_Barrier
@@ -4025,6 +4063,7 @@
 // critical region                        // critical region
 MPI_Accumulate(X, MPI_REPLACE, 0)         MPI_Accumulate(Y, MPI_REPLACE, 0)
 MPI_Win_flush(A)                          MPI_Win_flush(A)
+MPI_Win_unlock_all()                      MPI_Win_unlock_all()
 \end{verbatim}
 \end{example}
 
@@ -4042,6 +4081,7 @@
 Process A:                                Process B...:
 atomic location A                         
 A=0
+MPI_Win_lock_all()                        MPI_Win_lock_all()
 MPI_Win_sync
 MPI_Barrier                               MPI_Barrier
 stack variable r=1                        stack variable r=1
@@ -4052,6 +4092,7 @@
 // critical region                        // critical region
 r = MPI_Compare_and_swap(A, 1, 0)         r = MPI_Compare_and_swap(A, 1, 0)   
 MPI_Win_flush(A)                          MPI_Win_flush(A)
+MPI_Win_unlock_all()                      MPI_Win_unlock_all()
 \end{verbatim}
 \end{example}
 
@@ -4212,36 +4253,36 @@
 Accumulate calls enable element-wise atomic read and write to remote
 memory locations. MPI specifies ordering between accumulate operations
 from one process to the same (or overlapping) memory locations at
-another process. The default ordering is strict ordering which
-guarantees that overlapping updates from the same source to a remote
-location are committed in program order and that reads (e.g., with
-\mpifunc{MPI\_GET\_ACCUMULATE}) and writes (e.g., with
+another process on a per-datatype granularity. The default ordering is
+strict ordering which guarantees that overlapping updates from the same
+source to a remote location are committed in program order and that
+reads (e.g., with \mpifunc{MPI\_GET\_ACCUMULATE}) and writes (e.g., with
 \mpifunc{MPI\_ACCUMULATE}) are executed and committed in program order.
-Ordering only applies to operations originating at the same target that
-access overlapping memory regions. MPI does not provide any guarantees
-for accesses or updates from different targets to overlapping memory
-regions.
+Ordering only applies to operations originating at the same origin that
+access overlapping target memory regions. MPI does not provide any
+guarantees for accesses or updates from different origins to overlapping
+target memory regions.
 
 The default strict ordering may incur a significant performance penalty.
 MPI specifies the info key \infokey{accumulate\_ordering} to allow relaxation
 of the ordering semantics when specified to any window creation
-function. The key can have the values \infoval{complete}, \infoval{partial}, or
-\infoval{unordered}. The \infoval{complete} value is the default and thus identical
-to not having the \infokey{accumulate\_ordering} info key set at all.
+function. The possible values for this info key are discussed in
+Section~\ref{chap:one-side-2:win_create}.
 
-\begin{description}
-\item [\infoval{complete}] ordering guarantees that all accumulate operations to
-overlapping memory are observed and committed in program order.
-\item [\infoval{partial}] ordering guarantees ordering between read-read,
-write-write, and write-read but not between read-write operations.
-\item [\infoval{unordered}] ordering does not guarantee any ordering between
-operations.
-\end{description}
+%The key can have the values \infoval{complete}, \infoval{partial}, or
+%\infoval{unordered}. The \infoval{complete} value is the default and thus identical
+%to not having the \infokey{accumulate\_ordering} info key set at all.
 
-\hcomment{fix ordering in the other places too, mention often!!}
+%\begin{description}
+%\item [\infoval{complete}] ordering guarantees that all accumulate operations to
+%overlapping memory are observed and committed in program order.
+%\item [\infoval{partial}] ordering guarantees ordering between read-read,
+%write-write, and write-read but not between read-write operations.
+%\item [\infoval{unordered}] ordering does not guarantee any ordering between
+%operations.
+%\end{description}
 
-\hcomment{write-read ordering is not easy to provide for reliable
-  networks; InfiniBand, for example, doesn't provide it.}
+%\hcomment{fix ordering in the other places too, mention often!!}
 
 \haddend