Submitted by: Christian Siebert Date: 2008-11-25 Initial Version: 11-14-2008 Description: Proposal for several minor textual corrections/improvements to the draft for the nonblocking collectives chapter. --- coll.tex 2008-11-25 06:10:29.000000000 +0100 +++ coll_patched.tex 2008-11-25 18:13:41.000000000 +0100 @@ -4421,8 +4421,8 @@ leads to better performance (i.e, avoids context switching and scheduler overheads and thread management~\cite{hoefler-ib-threads}) is the use of nonblocking collective communication. The model is similar to point-to-point communications. A nonblocking -start call is used to start a collective communication. A separate -complete call is needed to complete the communication. As in the +start call is used to initiate a collective communication +which is eventually completed by a separate call. As in the nonblocking point-to-point case, the communication can progress independently of the computations at all participating processes. Nonblocking collective communication can also be used to mitigate @@ -4432,16 +4432,18 @@ As in the point-to-point case, all start calls are local and return immediately, irrespective of the status of other processes. Multiple nonblocking collective communications can be outstanding on a single -communicator. If the call causes some system resource to be exhausted, -then it will fail and return an error code. Quality implementations of -MPI should ensure that this happens only in ``pathological'' cases. That -is, an MPI implementation should be able to support a large number of +communicator. %If the call causes some system resource to be exhausted, +%then it will fail and return an error code. Quality implementations of +%MPI should ensure that this happens only in ``pathological'' cases. That is, +%% chsi: Although the above two sentences are consistent with MPI-2.1 +%% p 48, l 17, their content is almost zero (see p 264, l 23). +An MPI implementation should be able to support a large number of pending nonblocking operations. A nonblocking collective call indicates that the system may start -copying data out of the send buffer and into the receive buffer. The -buffers should not be accessed after a nonblocking collective operation -is called, until it completed. +copying data out of the send buffer and into the receive buffer. All +associated buffers should not be accessed between the initiation and the +completion of a nonblocking collective operation. % Collective operations complete when the local part of the operation has been performed (i.e., the semantics are guaranteed) and all buffers can @@ -4467,9 +4469,9 @@ implementation and is consistent to blocking point-to-point operations. \begin{implementors} -Nonblocking collective operations can be implemented with a local +Nonblocking collective operations can be implemented with local execution schedules~\cite{hoefler-sc07} using normal point-to-point -communication using a reserved tag-space. +communication and a reserved tag-space. \end{implementors} % to stay close to the current MPI semantics for @@ -4489,7 +4491,7 @@ \begin{rationale} Matching blocking and nonblocking collectives is not allowed because the -implementation might choose different communication algorithms for both. +implementation might choose different communication algorithms. Blocking collectives only need to be optimized for latency while nonblocking collectives have to find an equilibrium between latency, CPU overhead and asynchronous progression. @@ -4555,10 +4557,10 @@ \begin{users} -A nonblocking barrier might sound like an oxymoron, however, there are codes -that may move independent computations between the \mpifunc{MPI\_IBARRIER} and -the subsequent \mpifunc{MPI\_$\{$WAIT,TEST$\}$} call to overlap the barrier -latency to shorten possible waiting times. The semantic properties are also +A nonblocking barrier might sound like an oxymoron, however, moving +independent computations between the \mpifunc{MPI\_IBARRIER} and +the subsequent completion call can overlap the barrier latency and +therefore shorten possible waiting times. The semantic properties are also useful when mixing collectives and point-to-point messages. \end{users} @@ -4614,10 +4616,10 @@ and \mpiiidotiMergeNEWforSINGLEendI% MPI-2.1 round-two - end of modification \mpiarg{root}. -On completion, the +After completion, the \mpiiidotiMergeFinalREVIEWbegin{52.b}% MPI-2.1 Correction due to Reviews at MPI-2.1 Forum meeting April 26-28, 2008 % contents of \mpiarg{root}'s communication buffer has been copied to all processes. -content of \mpiarg{root}'s buffer is copied to all other processes. +content of \mpiarg{root}'s buffer has been copied to all other processes. \mpiiidotiMergeFinalREVIEWendI{52.b}% MPI-2.1 End of correction %General, derived datatypes are allowed for \mpiarg{datatype}. @@ -4664,7 +4666,7 @@ \exindex{MPI\_Bcast} Broadcast 100 {\tt int}s from process {\tt 0} to every process in the -group and performs some computation on independent data. +group and perform some computation on independent data. \begin{verbatim} MPI_Comm comm; @@ -4723,12 +4725,12 @@ \mpiiidotiMergeFromTWOdotZERObegin% MPI-2.1 - take lines: MPI-2.0, Chap. 7, p.154 l.20-22 , File 2.0/collective-2.tex, lines 644-647 \begchangefiniii -\mpicppemptybind{MPI::Comm::Gather(const void*~sendbuf, int~sendcount, const MPI::Datatype\&~sendtype, void*~recvbuf, int~recvcount, const~MPI::Datatype\&~recvtype, int~root) const~=~0}{MPI::Request} +\mpicppemptybind{MPI::Comm::Igather(const void*~sendbuf, int~sendcount, const MPI::Datatype\&~sendtype, void*~recvbuf, int~recvcount, const~MPI::Datatype\&~recvtype, int~root) const~=~0}{MPI::Request} \endchangefiniii \mpiiidotiMergeFromTWOdotZEROend% MPI-2.1 - end of take lines \mpiiidotiMergeFromONEdotTHREEbegin% MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.95 l.46 - p.96 l.25, File 1.3/coll.tex, lines 246-293 -This operations starts a nonblocking gather. The memory movements after +This operation starts a nonblocking gather. The data placements after the operation completes are identical to the blocking call \mpifunc{MPI\_GATHER}. @@ -4922,7 +4924,7 @@ \mpiiidotiMergeFromTWOdotZEROend% MPI-2.1 - end of take lines \mpiiidotiMergeFromONEdotTHREEbegin% MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.109 l.25 - p.109 l.36, File 1.3/coll.tex, lines 1084-1112 -The data movement after \func{MPI\_IALLGATHER} an operation completes is +The data movement after an \func{MPI\_IALLGATHER} operation completes is identical to \func{MPI\_ALLGATHER}. \begin{funcdef}{MPI\_IALLGATHERV( sendbuf, sendcount, sendtype, recvbuf, @@ -5012,7 +5014,7 @@ \mpiiidotiMergeFromONEdotTHREEbegin% MPI-2.1 - take lines: MPI-1.1, Chap. 4, p.111 l.31 - p.111 l.47, File 1.3/coll.tex, lines 1204-1233 -The data movement after \func{MPI\_IALLTOALL} an operation completes is +The data movement after an \func{MPI\_IALLTOALL} operation completes is identical to \func{MPI\_ALLTOALL}. @@ -5139,8 +5141,8 @@ %The following function \mpifunc{MPI\_IALLTOALLW} is the nonblocking variant of \mpifunc{MPI\_ALLTOALLW}. It starts a nonblocking all-to-all operation -which delivers the same results as \mpifunc{MPI\_ALLTOALLW} after it -completed. +which delivers the same results as \mpifunc{MPI\_ALLTOALLW} after +completion. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IALLTOALL END %%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IREDUCE START %%%%%%%%%%%%%%%%%%%%%%% @@ -5183,9 +5185,10 @@ \mpifunc{MPI\_IREDUCE} is the nonblocking variant of \mpifunc{MPI\_REDUCE}. It starts a nonblocking reduction operation -which delivers the same results as \mpifunc{MPI\_REDUCE} after it -completed. +which delivers the same results as \mpifunc{MPI\_REDUCE} after +completion. +% chsi: Should we really keep this advice also for the nonblocking version? \begin{implementors} It is strongly recommended that \mpifunc{MPI\_IREDUCE} be implemented so that the same result be obtained @@ -5255,8 +5258,8 @@ \mpifunc{MPI\_IALLREDUCE} is the nonblocking variant of \mpifunc{MPI\_ALLREDUCE}. It starts a nonblocking reduction-to-all operation -which delivers the same results as \mpifunc{MPI\_ALLREDUCE} after it -completed. +which delivers the same results as \mpifunc{MPI\_ALLREDUCE} after +completion. @@ -5304,8 +5307,8 @@ \mpifunc{MPI\_IREDUCE\_SCATTER} is the nonblocking variant of \mpifunc{MPI\_REDUCE\_SCATTER}. It starts a nonblocking reduce-scatter operation -which delivers the same results as \mpifunc{MPI\_REDUCE\_SCATTER} after it -completed. +which delivers the same results as \mpifunc{MPI\_REDUCE\_SCATTER} after +completion. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IREDUCE_SCATTER END %%%%%%%%%%%%%%%%%%%%%%% @@ -5340,8 +5343,8 @@ \mpifunc{MPI\_ISCAN} is the nonblocking variant of \mpifunc{MPI\_SCAN}. It starts a nonblocking scan operation -which delivers the same results as \mpifunc{MPI\_SCAN} after it -completed. +which delivers the same results as \mpifunc{MPI\_SCAN} after +completion. @@ -5380,8 +5383,8 @@ \mpifunc{MPI\_IEXSCAN} is the nonblocking variant of \mpifunc{MPI\_EXSCAN}. It starts a nonblocking exclusive scan operation -which delivers the same results as \mpifunc{MPI\_EXSCAN} after it -completed. +which delivers the same results as \mpifunc{MPI\_EXSCAN} after +completion. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% IEXSCAN END %%%%%%%%%%%%%%%%%%%%%%% @@ -5733,11 +5736,11 @@ case 0: MPI_Ibarrier(comm, &req); MPI_Wait(&req, MPI_STATUS_IGNORE); - MPI_Send(buf1, count, type, 1, 0, comm); + MPI_Send(buf, count, dtype, 1, tag, comm); break; case 1: MPI_Ibarrier(comm, &req); - MPI_Recv(buf1, count, datatype, 0, 0, comm) + MPI_Recv(buf, count, dtype, 0, tag, comm, MPI_STATUS_IGNORE); MPI_Wait(&req, MPI_STATUS_IGNORE); break; } @@ -5775,23 +5778,23 @@ enable multiple completions. The following program is valid. \begin{verbatim} -MPI_Request req[2]; +MPI_Request reqs[2]; switch(rank) { case 0: - MPI_Ibarrier(comm, &req[0]); - MPI_Send(buf1, count, type, 1, 0, comm); - MPI_Wait(&req[0], MPI_STATUS_IGNORE); + MPI_Ibarrier(comm, &reqs[0]); + MPI_Send(buf, count, dtype, 1, tag, comm); + MPI_Wait(&reqs[0], MPI_STATUS_IGNORE); break; case 1: - MPI_Irecv(buf1, count, datatype, 0, 0, comm, &req[1]) - MPI_Ibarrier(comm, &req[1]); - MPI_Waitall(2, &req[1], MPI_STATUSES_IGNORE); + MPI_Irecv(buf, count, dtype, 0, tag, comm, &reqs[0]); + MPI_Ibarrier(comm, &reqs[1]); + MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE); break; } \end{verbatim} -The Wait call returns only after the barrier and the receive completed. +The Waitall call returns only after the barrier and the receive completed. } @@ -5808,20 +5811,20 @@ single communicator and match in order. \begin{verbatim} -MPI_Request req[3]; +MPI_Request reqs[3]; compute(buf1); -MPI_Ibcast(buf1, count, type, 0, comm, &req[0]); +MPI_Ibcast(buf1, count, dtype, 0, comm, &reqs[0]); compute(buf2); -MPI_Ibcast(buf2, count, type, 0, comm, &req[1]); +MPI_Ibcast(buf2, count, dtype, 0, comm, &reqs[1]); compute(buf3); -MPI_Ibcast(buf3, count, type, 0, comm, &req[2]); -MPI_Waitall(3, &req[0], MPI_STATUSES_IGNORE); +MPI_Ibcast(buf3, count, dtype, 0, comm, &reqs[2]); +MPI_Waitall(3, reqs, MPI_STATUSES_IGNORE); \end{verbatim} \begin{users} Pipelining and double-buffering techniques can efficiently be used to -overlap computation and communication in SPMD style programs. +overlap computation and communication. \end{users} \begin{implementors} @@ -5851,24 +5854,22 @@ collective operations can easily be used to achieve this task. \begin{verbatim} -MPI_Request req[2]; +MPI_Request reqs[2]; switch(rank) { case 0: - MPI_Iallreduce(sbuf1, rbuf1, count, type, MPI_SUM, comm1, &req[0]); - MPI_Iallreduce(sbuf3, rbuf3, count, type, MPI_SUM, comm3, &req[1]); - MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE); + MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]); + MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]); break; case 1: - MPI_Iallreduce(sbuf1, rbuf1, count, type, MPI_SUM, comm1, &req[0]); - MPI_Iallreduce(sbuf2, rbuf2, count, type, MPI_SUM, comm2, &req[1]); - MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE); + MPI_Iallreduce(sbuf1, rbuf1, count, dtype, MPI_SUM, comm1, &reqs[0]); + MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[1]); break; case 2: - MPI_Iallreduce(sbuf2, rbuf2, count, type, MPI_SUM, comm2, &req[0]); - MPI_Iallreduce(sbuf3, rbuf3, count, type, MPI_SUM, comm3, &req[1]); - MPI_Waitall(2, &req[0], MPI_STATUSES_IGNORE); + MPI_Iallreduce(sbuf2, rbuf2, count, dtype, MPI_SUM, comm2, &reqs[0]); + MPI_Iallreduce(sbuf3, rbuf3, count, dtype, MPI_SUM, comm3, &reqs[1]); break; + MPI_Waitall(2, reqs, MPI_STATUSES_IGNORE); } \end{verbatim}