[Mpi3-rma] Updated proposal 1
Torsten Hoefler
htor at illinois.edu
Tue Feb 1 22:04:06 CST 2011
Hello All,
I updated the proposal as discussed in our last telecon (including
Rajeev's comments -- thanks!). Significant changes:
- moved all info keys to win_create
- added info key about operation
- changed ordering info key to be more flexible
- fixed examples
- added comment for Pavan to fix a comment
- changed the color of the proposal 2 merges (as they are still being
discussed and need a better letter than "n").
The complete diff is attached to this email and the documents on the
wiki are updated. Please review!
https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/wiki/mpi3-rma-proposal1/
Thanks & All the Best,
Torsten
--
bash$ :(){ :|:&};: --------------------- http://www.unixer.de/ -----
Torsten Hoefler | Performance Modeling and Simulation Lead
Blue Waters Directorate | University of Illinois (UIUC)
1205 W Clark Street | Urbana, IL, 61801
NCSA Building | +01 (217) 244-7736
-------------- next part --------------
Index: one-side-2.tex
===================================================================
--- one-side-2.tex (revision 65)
+++ one-side-2.tex (working copy)
@@ -101,13 +115,13 @@
update)\hadd{, \mpifunc{MPI\_GET\_ACCUMULATE},
\mpifunc{MPI\_FETCH\_AND\_OP} (remote fetch and update),
\mpifunc{MPI\_COMPARE\_AND\_SWAP} (remote atomic swap
-operations), \mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
- \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}.}
+operations), \padd{\mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
+ \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}}.}
\gadd{When a reference is made to ``accumulate'' operations in the
following, it refers to the following operations:
\mpifunc{MPI\_ACCUMULATE}, \mpifunc{MPI\_GET\_ACCUMULATE},
\mpifunc{MPI\_FETCH\_AND\_OP}, \mpifunc{MPI\_COMPARE\_AND\_SWAP},
- \mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}.}
+ \padd{\mpifunc{MPI\_RACCUMULATE} and \mpifunc{MPI\_RGET\_ACCUMULATE}}.}
\hadd{\MPI/ supports two fundamentally different memory models. The
first model makes no assumption about memory consistency and is
@@ -199,6 +213,7 @@
%\subsection{\hadd{Collective Memory Window Creation}}
\subsection{Window Creation}
+\label{chap:one-side-2:win_create}
\begin{funcdef}{MPI\_WIN\_CREATE(base, size, disp\_unit, info, comm, win)}
\funcarg{\IN}{base}{initial address of window (choice)}
@@ -261,9 +276,29 @@
3-party communication, and \RMA/ can be implemented with no (less)
asynchronous
agent activity at this process.
+\haddbegin
+\item{\infokey{accumulate\_ordering}} --- if set to \constskip{none},
+then no ordering will be guaranteed for accumulate calls (see
+Section~\ref{sec:1sided-ordering}). The key can be set to a
+comma-separated list of required access orderings at the target. Allowed
+values in the comma-separated list are \constskip{rr}, \constskip{wr},
+\constskip{rw}, and \constskip{ww} for read-read, write-read,
+read-write, write-write ordering, respectively. For example, if only
+read-read and write-write ordering is required, then the value of the
+\infokey{accumulate\_ordering} key could be set to \constskip{rr,ww}.
+The order of values is not significant. If this info argument is not
+specified, then it defaults to \constskip{rr,rw,wrww}.
+\item{\infokey{accumulate\_ops}} --- if set to \constskip{same\_op},
+then the implementation will assume that all concurrent accumulate calls
+to the same target address will use the same operation. If set to
+\constskip{same\_op\_no\_op}, then the implementation will assume that
+all concurrent accumulate calls to the same target address will use the
+same operation or \mpiarg{MPI\_NO\_OP}. This can eliminate the need to
+protect access for certain operation types where the hardware can
+guarantee atomicity.
+\haddend
\end{description}
-\gcomment{Should \infokey{accumulate\_ordering} be here as well?}
\haddbegin
\begin{users}
@@ -945,13 +980,13 @@
\label{sec:onesided-putget}
\MPI/ supports \hreplace{three}{the following} \RMA/ communication calls: \mpifunc{MPI\_PUT}
-\hreplace{transfers}{and \mpifunc{MPI\_RPUT} transfer} data from the
+\preplace{transfers}{and \mpifunc{MPI\_RPUT} transfer} data from the
caller memory (origin) to the target memory;
-\mpifunc{MPI\_GET} \hreplace{transfers}{and \mpifunc{MPI\_RGET} transfer} data from the target memory to the caller
+\mpifunc{MPI\_GET} \preplace{transfers}{and \mpifunc{MPI\_RGET} transfer} data from the target memory to the caller
memory;
-\hreplace{and}{} \mpifunc{MPI\_ACCUMULATE} \hreplace{updates}{and \mpifunc{MPI\_RACCUMULATE} update} locations in the target memory,
+\hreplace{and}{} \mpifunc{MPI\_ACCUMULATE} \preplace{updates}{and \mpifunc{MPI\_RACCUMULATE} update} locations in the target memory,
e.g.\hadd{,} by adding to these locations values sent from the caller
-memory\hreplace{.}{; \mpifunc{MPI\_GET\_ACCUMULATE}, \mpifunc{MPI\_RGET\_ACCUMULATE} and
+memory\preplace{.}{; \mpifunc{MPI\_GET\_ACCUMULATE}, \mpifunc{MPI\_RGET\_ACCUMULATE} and
\mpifunc{MPI\_FETCH\_AND\_OP} atomically return the data
before the accumulate operation; and
\mpifunc{MPI\_COMPARE\_AND\_SWAP} performs a remote compare and swap
@@ -962,7 +997,7 @@
a subsequent {\em synchronization} call is issued by the caller on
the involved window object. These synchronization calls are described in
Section~\ref{sec:1sided-sync}, page~\pageref{sec:1sided-sync}.
-\hadd{Transfers can also be completed with calls to flush routines, see
+\padd{Transfers can also be completed with calls to flush routines, see
Section~\ref{sec:1sided-flush} for details. For the
\mpifunc{MPI\_RPUT}, \mpifunc{MPI\_RGET},
\mpifunc{MPI\_RACCUMULATE}, and
@@ -1156,7 +1191,7 @@
\end{implementors}
-\haddbegin
+\paddbegin
\begin{funcdef}{MPI\_RPUT(origin\_addr, origin\_count,
origin\_datatype, target\_rank, target\_disp, target\_count,
@@ -1221,7 +1256,7 @@
operation might complete locally.
\end{users}
-\haddend
+\paddend
\subsection{Get}
@@ -1264,7 +1299,7 @@
in the origin buffer.
-\haddbegin
+\paddbegin
\begin{funcdef}{MPI\_RGET(origin\_addr, origin\_count,
origin\_datatype, target\_rank, target\_disp, target\_count, \\
@@ -1307,7 +1342,7 @@
\mpifunc{MPI\_RGET} operation indicates that the data is available
in the origin buffer.
-\haddend
+\paddend
\subsection{Examples}
\label{sec:1sided-example}
@@ -1536,7 +1571,7 @@
%tricky for the user to decide which OP is available remotely.}
\const{MPI\_REPLACE} can be used only in \mpifunc{MPI\_ACCUMULATE},
-\hreplace{}{ \mpifunc{MPI\_RACCUMULATE},
+\preplace{}{ \mpifunc{MPI\_RACCUMULATE},
\mpifunc{MPI\_GET\_ACCUMULATE}, and
\mpifunc{MPI\_RGET\_ACCUMULATE}} \gadd{, but }not in collective
reduction operations\gdelete{,} such as \mpifunc{MPI\_REDUCE}.
@@ -1549,7 +1584,7 @@
\end{users}
-\haddbegin
+\paddbegin
\begin{funcdef}{MPI\_RACCUMULATE(origin\_addr, origin\_count, origin\_datatype, target\_rank, target\_disp, target\_count,
target\_datatype, op, win, req)}
@@ -1584,9 +1619,9 @@
Similar to \mpifunc{MPI\_ACCUMULATE}, except that it returns a request
handle that can be waited or tested on.
-\gcomment{Reword to not end on ``on''.}
+\gcomment{Reword to not end on ``on'' -- Pavan does this.}
-\haddend
+\paddend
\begin{example}{\rm
@@ -1718,7 +1753,8 @@
%
A new predefined operation, \const{MPI\_NO\_OP}, is defined.
It corresponds to the associative function $f(a,b) = a$; i.e., the current
-value in the target memory is returned in the result buffer at the origin.
+value in the target memory is returned in the result buffer at the
+origin, and no operation is performed on the target buffer.
%
\const{MPI\_NO\_OP} can be used only in \mpifunc{MPI\_GET\_ACCUMULATE}
and \mpifunc{MPI\_FETCH\_AND\_OP}, not in \mpifunc{MPI\_ACCUMULATE} or
@@ -1734,6 +1770,7 @@
have different constraints on concurrent updates.
\end{users}
+\paddbegin
\begin{funcdef}{MPI\_RGET\_ACCUMULATE(origin\_addr, origin\_count,
origin\_datatype, result\_addr, result\_count, results\_datatype,
@@ -1773,8 +1810,8 @@
Similar to \mpifunc{MPI\_GET\_ACCUMULATE}, except that it returns a
request handle that can be waited or tested on.
+\paddend
-
\subsubsection{Fetch and Op Function}
\label{sec:1sided-fetchandop}
@@ -3719,12 +3756,8 @@
operation once an update to that location has started, until the
update becomes visible in the public window copy. There is one
exception to this rule, in the case where the same variable is updated
-by two concurrent accumulates that use the same operation, with the same
-predefined datatype, on the same window. \hcomment{Brian and Torsten think
-that this is limiting the usefulness of the programming model
-significantly, can we relax this (remove the ``same op'' restriction?
-However, we have to be careful to still allow hardware optimizations of
-subsets of operations only. Can this be done?}
+by two concurrent accumulates \hreplace{that use the same operation, }with the same
+predefined datatype, on the same window.
\item
A put or accumulate must not access a target window once a local update
or a put or accumulate update to another (overlapping) target window
@@ -3880,11 +3913,13 @@
Process A: Process B:
window location X
+MPI_Win_lock_all() MPI_Win_lock_all()
store X /* update to private&public copy of B */
MPI_Win_sync
MPI_Barrier MPI_Barrier
MPI_Get(X) /* ok, read from window */
MPI_Win_flush_local(B)
+MPI_Win_unlock_all() MPI_Win_unlock_all()
/* read value */
\end{verbatim}
@@ -3939,13 +3974,14 @@
\begin{verbatim}
Process A: Process B:
window location X
-
+MPI_Win_lock_all() MPI_Win_lock_all()
MPI_Put(X) /* update to window */
MPI_Win_flush(B)
MPI_Barrier MPI_Barrier
MPI_Win_sync
load X
+MPI_Win_unlock_all() MPI_Win_unlock_all()
\end{verbatim}
Note that the private copy of X has been updated after the barrier.
\end{example}
@@ -3977,6 +4013,7 @@
Process A: Process B:
window location X
X=2
+MPI_Win_lock_all() MPI_Win_lock_all()
MPI_Win_sync
MPI_Barrier MPI_Barrier
@@ -3988,7 +4025,7 @@
MPI_Win_flush(A) MPI_Win_flush(A)
done done
-MPI_Barrier MPI_Barrier
+MPI_Win_unlock_all() MPI_Win_unlock_all()
\end{verbatim}
\end{example}
@@ -4010,6 +4047,7 @@
window location X window location Y
window location T
+MPI_Win_lock_all() MPI_Win_lock_all()
X=1 Y=1
MPI_Win_sync MPI_Win_sync
MPI_Barrier MPI_Barrier
@@ -4025,6 +4063,7 @@
// critical region // critical region
MPI_Accumulate(X, MPI_REPLACE, 0) MPI_Accumulate(Y, MPI_REPLACE, 0)
MPI_Win_flush(A) MPI_Win_flush(A)
+MPI_Win_unlock_all() MPI_Win_unlock_all()
\end{verbatim}
\end{example}
@@ -4042,6 +4081,7 @@
Process A: Process B...:
atomic location A
A=0
+MPI_Win_lock_all() MPI_Win_lock_all()
MPI_Win_sync
MPI_Barrier MPI_Barrier
stack variable r=1 stack variable r=1
@@ -4052,6 +4092,7 @@
// critical region // critical region
r = MPI_Compare_and_swap(A, 1, 0) r = MPI_Compare_and_swap(A, 1, 0)
MPI_Win_flush(A) MPI_Win_flush(A)
+MPI_Win_unlock_all() MPI_Win_unlock_all()
\end{verbatim}
\end{example}
@@ -4212,36 +4253,36 @@
Accumulate calls enable element-wise atomic read and write to remote
memory locations. MPI specifies ordering between accumulate operations
from one process to the same (or overlapping) memory locations at
-another process. The default ordering is strict ordering which
-guarantees that overlapping updates from the same source to a remote
-location are committed in program order and that reads (e.g., with
-\mpifunc{MPI\_GET\_ACCUMULATE}) and writes (e.g., with
+another process on a per-datatype granularity. The default ordering is
+strict ordering which guarantees that overlapping updates from the same
+source to a remote location are committed in program order and that
+reads (e.g., with \mpifunc{MPI\_GET\_ACCUMULATE}) and writes (e.g., with
\mpifunc{MPI\_ACCUMULATE}) are executed and committed in program order.
-Ordering only applies to operations originating at the same target that
-access overlapping memory regions. MPI does not provide any guarantees
-for accesses or updates from different targets to overlapping memory
-regions.
+Ordering only applies to operations originating at the same origin that
+access overlapping target memory regions. MPI does not provide any
+guarantees for accesses or updates from different origins to overlapping
+target memory regions.
The default strict ordering may incur a significant performance penalty.
MPI specifies the info key \infokey{accumulate\_ordering} to allow relaxation
of the ordering semantics when specified to any window creation
-function. The key can have the values \infoval{complete}, \infoval{partial}, or
-\infoval{unordered}. The \infoval{complete} value is the default and thus identical
-to not having the \infokey{accumulate\_ordering} info key set at all.
+function. The possible values for this info key are discussed in
+Section~\ref{chap:one-side-2:win_create}.
-\begin{description}
-\item [\infoval{complete}] ordering guarantees that all accumulate operations to
-overlapping memory are observed and committed in program order.
-\item [\infoval{partial}] ordering guarantees ordering between read-read,
-write-write, and write-read but not between read-write operations.
-\item [\infoval{unordered}] ordering does not guarantee any ordering between
-operations.
-\end{description}
+%The key can have the values \infoval{complete}, \infoval{partial}, or
+%\infoval{unordered}. The \infoval{complete} value is the default and thus identical
+%to not having the \infokey{accumulate\_ordering} info key set at all.
-\hcomment{fix ordering in the other places too, mention often!!}
+%\begin{description}
+%\item [\infoval{complete}] ordering guarantees that all accumulate operations to
+%overlapping memory are observed and committed in program order.
+%\item [\infoval{partial}] ordering guarantees ordering between read-read,
+%write-write, and write-read but not between read-write operations.
+%\item [\infoval{unordered}] ordering does not guarantee any ordering between
+%operations.
+%\end{description}
-\hcomment{write-read ordering is not easy to provide for reliable
- networks; InfiniBand, for example, doesn't provide it.}
+%\hcomment{fix ordering in the other places too, mention often!!}
\haddend
More information about the mpiwg-rma
mailing list