[mpi3-coll] New nonblocking collective intro text
moody20 at llnl.gov
Wed Jan 28 14:18:03 CST 2009
Since this change was to smooth the flow, it's easiest to present the
new text in complete form (so it can be read in continuous flow). The
content is practically unchanged -- only the presentation is rearranged.
I've tried to incorporate everyone's suggestions from yesterday's
telecon, but it's possible that I may have missed some. Could you please
review this new text to check that I have included your suggestions?
(*) I modified Jesper's first line on page 49, line 26 from: "As
described in Section ?? (Section 3.7), the performance of many
applications can be improved by overlapping communication and
computation, and many systems enable this." to: "As described in Section
?? (Section 3.7), performance on many systems can be improved by
overlapping communication and computation." This better matches the
point-to-point text (in 2.1 document, see p 47, line 42).
(*) I added an advice to users section regarding synchronization
side-effects (derived from collectives intro in 2.1 doc on p 131, lines
(*) In the current document, on page 50, line 36 there is a broken
sentence, which was apparently introduced after the call yesterday. I
left this change out, since I wasn't sure of the exact intent. My
proposed text is based on the previous version.
If I left out other changes, please let me know. Also, I can try to post
more detailed changes later, but it'll take more effort and I wanted to
get this out there to give folks a head start. Most of the changes are
obvious when you read this text side-by-side with the current text.
Could a few folks please review this carefully and compare it to the
My proposal is to replace lines from p 49, line 26 to p 51, line 15 with
As described in Section ?? (Section 3.7), performance on many systems can be improved by overlapping communication and computation. Nonblocking collectives combine the potential benefits of nonblocking point-to-point operations to exploit overlap and to avoid synchronization with the optimized implementation and message scheduling provided by collective operations [1,4]. One way of doing this would be to perform a blocking collective operation in a separate thread. An alternative mechanism that often leads to better performance (i.e., avoids context switching, scheduler overheads, and thread management ) is to use nonblocking collective communication.
The nonblocking collective communication model is similar to the model used in nonblocking point-to-point communication. A nonblocking start call initiates the collective operation, but does not complete it. A separate complete call is needed to complete the operation. Once initiated, the operation may progress independently of any computation or other communication at participating processes. In this manner, nonblocking collectives can mitigate synchronizing effects of collective operations by running them in the "background." In addition to enabling communication-computation overlap, nonblocking collectives can perform collective operations on overlapping communicators that would lead to deadlock with blocking operations. The semantic advantages of nonblocking collectives can also be useful in combination with point-to-point communication.
As in the nonblocking point-to-point case, all start calls are local and return immediately irrespective of the status of other processes. The start call initiates the operation which indicates that the system may start to copy data out of the send buffer and into the receive buffer. Once intiated, all associated send buffers should not be modified and all associated receive buffers should not be accessed until the collective operation completes. The start call returns a request handle which must be passed to a complete call to complete the operation.
All completion calls (e.g., MPI_WAIT) described in Section ?? (Section 3.7.3) are supported for nonblocking collective operations. Similarly to the blocking case, collective operations are considered to be complete when the local part of the operation is finished, i.e., the semantics of the operation are guaranteed and all buffers can be safely accessed and modified. Completion does not imply that other processes have completed or even started the operation unless otherwise specified in or implied by the description of the operation.
Advice to users:
Some implementations may have the effect of synchronizing processes during the completion of a nonblocking collective. A correct, portable program can not rely on such synchronization side-effects, however, one must program so as to allow them. (End of advice to users.)
Upon returning from a completion call in which a nonblocking colletive completes, the MPI_ERROR field in the associated status object is set appropriately to indicate any errors. The values of the MPI_SOURCE and MPI_TAG fields are undefined. It is valid to mix different request types (i.e., any combination of collective requests, I/O requests, generalized requests, or point-to-point requests) in functions that enable multiple completions (e.g., MPI_WAITALL). It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL with a request for a nonblocking collective operation.
Rationale. Freeing an active nonblocking collective request could cause similar problems as discussed for point-to-point requests (see Section ?? (3.7.3)). Cancelling a request is not supported because the semantics of this operation are not well-defined. (End of rationale.)
Multiple nonblocking collective operations can be outstanding on a single communicator. If the nonblocking collective causes some system resource to be exhausted, then it will fail and generate an MPI exception. Quality implementations of MPI should ensure that this happens only in pathological cases. That is, an MPI implementation should be able to support a large number of pending nonblocking collective operations.
Unlike point-to-point operations, nonblocking collective operations do not match with blocking collectives, and collective operations do not have a tag argument. All processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a process calls a collective operation, all other processes in the communicator must eventually call the same collective operation, and no other collective operation in between. This is consistent with the ordering rules for blocking collective operations in threaded environments.
Matching blocking and nonblocking collectives is not allowed because an implementation might use different communication algorithms
for the two cases. Blocking collectives may be optimized for minimal time to completion, while nonblocking collectives may balance time to completion with CPU overhead and asynchronous progression.
The use of tags for collective operations can prevent certain hardware optimizations. (End of rationale.)
Advice to users:
If program semantics require matching blocking and nonblocking collectives, then a nonblocking collective operation can be initiated and immediately completed with a blocking wait to emulate blocking behavior. (End of advice to users.)
In terms of data movements, each nonblocking collective operation has the same effect as its blocking counterpart for intracommunicators and intercommunicators after completion. The use of the “in place” option is allowed exactly as described for the corresponding blocking collective operations. Likewise, upon completion, the nonblocking collective reduction operations have the same effect as their blocking counterparts, and the same restrictions and recommendations on reduction orders apply.
Progression rules for nonblocking collectives are similar to progression of nonblocking point-to-point operations, refer to Section ?? (Section 3.7.4).
Advice to implementors. Nonblocking collective operations can be implemented with local execution schedules  using nonblocking point-to-point communication and a reserved tag-space. (End of advice to implementors.)
More information about the mpiwg-coll