Submitted by: Christian Siebert <siebert@it.neclab.eu>
Date: 2008-11-25
Initial Version: 11-14-2008
Description: Proposal for several minor textual corrections/improvements
             to the draft for the nonblocking collectives chapter.

--- coll.tex	2008-11-25 06:10:29.000000000 +0100
+++ coll_patched.tex	2008-11-25 18:13:41.000000000 +0100
@@ -4421,8 +4421,8 @@
 leads to better performance (i.e, avoids context switching and scheduler overheads and
 thread management~\cite{hoefler-ib-threads}) is the use of nonblocking collective communication.
 The model is similar to point-to-point communications. A nonblocking
-start call is used to start a collective communication. A separate
-complete call is needed to complete the communication. As in the
+start call is used to initiate a collective communication
+which is eventually completed by a separate call. As in the
 nonblocking point-to-point case, the communication can progress
 independently of the computations at all participating processes.
 Nonblocking collective communication can also be used to mitigate
@@ -4432,16 +4432,18 @@
 As in the point-to-point case, all start calls are local and return
 immediately, irrespective of the status of other processes. Multiple
 nonblocking collective communications can be outstanding on a single
-communicator. If the call causes some system resource to be exhausted,
-then it will fail and return an error code. Quality implementations of
-MPI should ensure that this happens only in ``pathological'' cases. That
-is, an MPI implementation should be able to support a large number of
+communicator. %If the call causes some system resource to be exhausted,
+%then it will fail and return an error code. Quality implementations of
+%MPI should ensure that this happens only in ``pathological'' cases. That is,
+%% chsi: Although the above two sentences are consistent with MPI-2.1
+%%       p 48, l 17, their content is almost zero (see p 264, l 23).
+An MPI implementation should be able to support a large number of
 pending nonblocking operations.
 
 A nonblocking collective call indicates that the system may start
-copying data out of the send buffer and into the receive buffer. The
-buffers should not be accessed after a nonblocking collective operation
-is called, until it completed.
+copying data out of the send buffer and into the receive buffer. All
+associated buffers should not be accessed between the initiation and the
+completion of a nonblocking collective operation.
 %
 Collective operations complete when the local part of the operation has
 been performed (i.e., the semantics are guaranteed) and all buffers can
@@ -4467,9 +4469,9 @@
 implementation and is consistent to blocking point-to-point operations.
 
 \begin{implementors}
-Nonblocking collective operations can be implemented with a local
+Nonblocking collective operations can be implemented with local
 execution schedules~\cite{hoefler-sc07} using normal point-to-point
-communication using a reserved tag-space. 
+communication and a reserved tag-space. 
 \end{implementors}
 
 % to stay close to the current MPI semantics for
@@ -4489,7 +4491,7 @@
 
 \begin{rationale}
 Matching blocking and nonblocking collectives is not allowed because the
-implementation might choose different communication algorithms for both.
+implementation might choose different communication algorithms.
 Blocking collectives only need to be optimized for latency while
 nonblocking collectives have to find an equilibrium between latency, CPU
 overhead and asynchronous progression. 
@@ -4555,10 +4557,10 @@
 
 
 \begin{users}
-A nonblocking barrier might sound like an oxymoron, however, there are codes
-that may move independent computations between the \mpifunc{MPI\_IBARRIER} and
-the subsequent \mpifunc{MPI\_$\{$WAIT,TEST$\}$} call to overlap the barrier
-latency to shorten possible waiting times.  The semantic properties are also
+A nonblocking barrier might sound like an oxymoron, however, moving
+independent computations between the \mpifunc{MPI\_IBARRIER} and
+the subsequent completion call can overlap the barrier latency and
+therefore shorten possible waiting times.  The semantic properties are also
 useful when mixing collectives and point-to-point messages.
 \end{users}
 
@@ -4614,10 +4616,10 @@
 and
 \mpiiidotiMergeNEWforSINGLEendI% MPI-2.1 round-two - end of modification
 \mpiarg{root}.
-On completion, the 
+After completion, the 
 \mpiiidotiMergeFinalREVIEWbegin{52.b}%    MPI-2.1 Correction due to Reviews at MPI-2.1 Forum meeting April 26-28, 2008
 % contents of \mpiarg{root}'s communication buffer has been copied to all processes.
-content of \mpiarg{root}'s buffer is copied to all other processes.
+content of \mpiarg{root}'s buffer has been copied to all other processes.
 \mpiiidotiMergeFinalREVIEWendI{52.b}%     MPI-2.1 End of correction
 
 %General, derived datatypes are allowed for \mpiarg{datatype}.
@@ -4664,7 +4666,7 @@
 \exindex{MPI\_Bcast}
 
 Broadcast 100 {\tt int}s from process {\tt 0} to every process in the
-group and performs some computation on independent data.
+group and perform some computation on independent data.
 
 \begin{verbatim}
     MPI_Comm comm;
@@ -4723,12 +4725,12 @@
 \mpiiidotiMergeFromTWOdotZERObegin%    MPI-2.1 - take lines: MPI-2.0, Chap. 7, p.154 l.20-22 , File 2.0/collective-2.tex, lines 644-647
 
 \begchangefiniii
-\mpicppemptybind{MPI::Comm::Gather(const void*~sendbuf, int~sendcount, const MPI::Datatype\&~sendtype, void*~recvbuf, int~recvcount, const~MPI::Datatype\&~recvtype, int~root) const~=~0}{MPI::Request}
+\mpicppemptybind{MPI::Comm::Igather(const void*~sendbuf, int~sendcount, const MPI::Datatype\&~sendtype, void*~recvbuf, int~recvcount, const~MPI::Datatype\&~recvtype, int~root) const~=~0}{MPI::Request}
 \endchangefiniii
 \mpiiidotiMergeFromTWOdotZEROend%      MPI-2.1 - end of take lines
 \mpiiidotiMergeFromONEdotTHREEbegin%   MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.95 l.46 - p.96 l.25, File 1.3/coll.tex, lines 246-293
 
-This operations starts a nonblocking gather. The memory movements after
+This operation starts a nonblocking gather. The data placements after
 the operation completes are identical to the blocking call
 \mpifunc{MPI\_GATHER}. 
 
@@ -4922,7 +4924,7 @@
 \mpiiidotiMergeFromTWOdotZEROend%      MPI-2.1 - end of take lines
 \mpiiidotiMergeFromONEdotTHREEbegin%   MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.109 l.25 - p.109 l.36, File 1.3/coll.tex, lines 1084-1112
 
-The data movement after \func{MPI\_IALLGATHER} an operation completes is
+The data movement after an \func{MPI\_IALLGATHER} operation completes is
 identical to \func{MPI\_ALLGATHER}.
 
 \begin{funcdef}{MPI\_IALLGATHERV( sendbuf, sendcount, sendtype, recvbuf,
@@ -5012,7 +5014,7 @@
 \mpiiidotiMergeFromONEdotTHREEbegin%   MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.111 l.31 - p.111 l.47, File 1.3/coll.tex, lines 1204-1233
 
 
-The data movement after \func{MPI\_IALLTOALL} an operation completes is
+The data movement after an \func{MPI\_IALLTOALL} operation completes is
 identical to \func{MPI\_ALLTOALL}.
 
 
@@ -5139,8 +5141,8 @@
 %The following function
 \mpifunc{MPI\_IALLTOALLW} is the nonblocking variant of
 \mpifunc{MPI\_ALLTOALLW}. It starts a nonblocking all-to-all operation
-which delivers the same results as \mpifunc{MPI\_ALLTOALLW} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_ALLTOALLW} after
+completion.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IALLTOALL END %%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IREDUCE START %%%%%%%%%%%%%%%%%%%%%%%
@@ -5183,9 +5185,10 @@
 
 \mpifunc{MPI\_IREDUCE} is the nonblocking variant of
 \mpifunc{MPI\_REDUCE}. It starts a nonblocking reduction operation
-which delivers the same results as \mpifunc{MPI\_REDUCE} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_REDUCE} after
+completion.
 
+% chsi: Should we really keep this advice also for the nonblocking version?
 \begin{implementors}
 It is strongly recommended that \mpifunc{MPI\_IREDUCE} be implemented so
 that the same result be obtained
@@ -5255,8 +5258,8 @@
 
 \mpifunc{MPI\_IALLREDUCE} is the nonblocking variant of
 \mpifunc{MPI\_ALLREDUCE}. It starts a nonblocking reduction-to-all operation
-which delivers the same results as \mpifunc{MPI\_ALLREDUCE} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_ALLREDUCE} after
+completion.
 
 
 
@@ -5304,8 +5307,8 @@
 
 \mpifunc{MPI\_IREDUCE\_SCATTER} is the nonblocking variant of
 \mpifunc{MPI\_REDUCE\_SCATTER}. It starts a nonblocking reduce-scatter operation
-which delivers the same results as \mpifunc{MPI\_REDUCE\_SCATTER} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_REDUCE\_SCATTER} after
+completion.
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IREDUCE_SCATTER END %%%%%%%%%%%%%%%%%%%%%%%
@@ -5340,8 +5343,8 @@
 
 \mpifunc{MPI\_ISCAN} is the nonblocking variant of
 \mpifunc{MPI\_SCAN}. It starts a nonblocking scan operation
-which delivers the same results as \mpifunc{MPI\_SCAN} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_SCAN} after
+completion.
 
 
 
@@ -5380,8 +5383,8 @@
 
 \mpifunc{MPI\_IEXSCAN} is the nonblocking variant of
 \mpifunc{MPI\_EXSCAN}. It starts a nonblocking exclusive scan operation
-which delivers the same results as \mpifunc{MPI\_EXSCAN} after it
-completed.
+which delivers the same results as \mpifunc{MPI\_EXSCAN} after
+completion.
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IEXSCAN END %%%%%%%%%%%%%%%%%%%%%%%
 
@@ -5733,11 +5736,11 @@
     case 0:
       MPI_Ibarrier(comm, &req);
       MPI_Wait(&req, MPI_STATUS_IGNORE);
-      MPI_Send(buf1, count, type, 1, 0, comm);
+      MPI_Send(buf, count, dtype, 1, tag, comm);
       break;
     case 1:
       MPI_Ibarrier(comm, &req);
-      MPI_Recv(buf1, count, datatype, 0, 0, comm)
+      MPI_Recv(buf, count, dtype, 0, tag, comm, MPI_STATUS_IGNORE);
       MPI_Wait(&req, MPI_STATUS_IGNORE);
       break;
 }
@@ -5775,23 +5778,23 @@
 enable multiple completions. The following program is valid.
 
 \begin{verbatim}
-MPI_Request req[2];
+MPI_Request reqs[2];
 
 switch(rank) {
     case 0:
-      MPI_Ibarrier(comm, &req[0]);
-      MPI_Send(buf1, count, type, 1, 0, comm);
-      MPI_Wait(&req[0], MPI_STATUS_IGNORE);
+      MPI_Ibarrier(comm, &reqs[0]);
+      MPI_Send(buf, count, dtype, 1, tag, comm);
+      MPI_Wait(&reqs[0], MPI_STATUS_IGNORE);
       break;
     case 1:
-      MPI_Irecv(buf1, count, datatype, 0, 0, comm, &req[1])
-      MPI_Ibarrier(comm, &req[1]);
-      MPI_Waitall(2, &req[1], MPI_STATUSES_IGNORE);
+      MPI_Irecv(buf, count, dtype, 0, tag, comm, &reqs[0]);
+      MPI_Ibarrier(comm, &reqs[1]);
+      MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE);
       break;
 }
 \end{verbatim}
 
-The Wait call returns only after the barrier and the receive completed.
+The Waitall call returns only after the barrier and the receive completed.
 
 
 }
@@ -5808,20 +5811,20 @@
 single communicator and match in order. 
 
 \begin{verbatim}
-MPI_Request req[3];
+MPI_Request reqs[3];
 
 compute(buf1);
-MPI_Ibcast(buf1, count, type, 0, comm, &req[0]);
+MPI_Ibcast(buf1, count, dtype, 0, comm, &reqs[0]);
 compute(buf2);
-MPI_Ibcast(buf2, count, type, 0, comm, &req[1]);
+MPI_Ibcast(buf2, count, dtype, 0, comm, &reqs[1]);
 compute(buf3);
-MPI_Ibcast(buf3, count, type, 0, comm, &req[2]);
-MPI_Waitall(3, &req[0], MPI_STATUSES_IGNORE);
+MPI_Ibcast(buf3, count, dtype, 0, comm, &reqs[2]);
+MPI_Waitall(3, reqs, MPI_STATUSES_IGNORE);
 \end{verbatim}
 
 \begin{users}
 Pipelining and double-buffering techniques can efficiently be used to
-overlap computation and communication in SPMD style programs.
+overlap computation and communication.
 \end{users}
 
 \begin{implementors}
@@ -5851,24 +5854,22 @@
 collective operations can easily be used to achieve this task.
 
 \begin{verbatim}
-MPI_Request req[2];
+MPI_Request reqs[2];
 
 switch(rank) {
     case 0:
-      MPI_Iallreduce(sbuf1, rbuf1, count, type, MPI_SUM, comm1, &req[0]);
-      MPI_Iallreduce(sbuf3, rbuf3, count, type, MPI_SUM, comm3, &req[1]);
-      MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE);
+      MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]);
+      MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]);
       break;
     case 1:
-      MPI_Iallreduce(sbuf1, rbuf1, count, type, MPI_SUM, comm1, &req[0]);
-      MPI_Iallreduce(sbuf2, rbuf2, count, type, MPI_SUM, comm2, &req[1]);
-      MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE);
+      MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]);
+      MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[1]);
       break;
     case 2:
-      MPI_Iallreduce(sbuf2, rbuf2, count, type, MPI_SUM, comm2, &req[0]);
-      MPI_Iallreduce(sbuf3, rbuf3, count, type, MPI_SUM, comm3, &req[1]);
-      MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE);
+      MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[0]);
+      MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]);
       break;
+    MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE);
 }
 \end{verbatim}