[mpi3-coll] New nonblocking collective intro text

Thu Jan 29 13:39:12 CST 2009

Here is the updated text after incorporating the fixes Bronis caught. 
Thanks, Bronis.
-Adam

As described in Section ?? (Section 3.7), performance on many systems 
can be improved by overlapping communication and computation. 
Nonblocking collectives combine the potential benefits of nonblocking 
point-to-point operations to exploit overlap and to avoid 
synchronization with the optimized implementation and message scheduling 
provided by collective operations [1,4]. One way of doing this would be 
to perform a blocking collective operation in a separate thread. An 
alternative mechanism that often leads to better performance (e.g., 
avoids context switching, scheduler overheads, and thread management) is 
to use nonblocking collective communication [2].

The nonblocking collective communication model is similar to the model 
used in nonblocking point-to-point communication. A nonblocking start 
call initiates the collective operation, but does not complete it. A 
separate completion call is needed to complete the operation. Once 
initiated, the operation may progress independently of any computation 
or other communication at participating processes. In this manner, 
nonblocking collectives can mitigate synchronizing effects of collective 
operations by running them in the "background." In addition to enabling 
communication-computation overlap, nonblocking collectives can perform 
collective operations on overlapping communicators that would lead to 
deadlock with blocking operations. The semantic advantages of 
nonblocking collectives can also be useful in combination with 
point-to-point communication.

As in the nonblocking point-to-point case, all start calls are local and 
return immediately irrespective of the status of other processes. The 
start call initiates the operation which indicates that the system may 
start to copy data out of the send buffer and into the receive buffer. 
Once intiated, all associated send buffers should not be modified and 
all associated receive buffers should not be accessed until the 
collective operation completes. The start call returns a request handle, 
which must be passed to a completion call to complete the operation.

All completion calls (e.g., MPI_WAIT) described in Section ?? (Section 
3.7.3) are supported for nonblocking collective operations. Similarly to 
the blocking case, collective operations are considered to be complete 
when the local part of the operation is finished, i.e., the semantics of 
the operation are guaranteed and all buffers can be safely accessed and 
modified. Completion does not imply that other processes have completed 
or even started the operation unless otherwise specified in or implied 
by the description of the operation. Completion of a particular 
nonblocking collective operation also does not imply completion of any 
other posted nonblocking collective (or send-receive) operations, 
whether they are posted before or after the completed operation.

Advice to users. Some implementations may have the effect of 
synchronizing processes during the completion of a nonblocking 
collective. A correct, portable program cannot rely on such 
synchronization side-effects, however, one must program so as to allow 
them. (End of advice to users.)

Upon returning from a completion call in which a nonblocking colletive 
completes, the MPI_ERROR field in the associated status object is set 
appropriately to indicate any errors. The values of the MPI_SOURCE and 
MPI_TAG fields are undefined. It is valid to mix different request types 
(i.e., any combination of collective requests, I/O requests, generalized 
requests, or point-to-point requests) in functions that enable multiple 
completions (e.g., MPI_WAITALL). It is erroneous to call 
MPI_REQUEST_FREE or MPI_CANCEL with a request for a nonblocking 
collective operation.

Rationale. Freeing an active nonblocking collective request could cause 
similar problems as discussed for point-to-point requests (see Section 
?? (3.7.3)). Cancelling a request is not supported because the semantics 
of this operation are not well-defined. (End of rationale.)

Multiple nonblocking collective operations can be outstanding on a 
single communicator. If the nonblocking collective causes some system 
resource to be exhausted, then it will fail and generate an MPI 
exception. Quality implementations of MPI should ensure that this 
happens only in pathological cases. That is, an MPI implementation 
should be able to support a large number of pending nonblocking 
collective operations.

Unlike point-to-point operations, nonblocking collective operations do 
not match with blocking collectives, and collective operations do not 
have a tag argument. All processes must call collective operations 
(blocking and nonblocking) in the same order per communicator. In 
particular, once a process calls a collective operation, all other 
processes in the communicator must eventually call the same collective 
operation, and no other collective operation in between. This is 
consistent with the ordering rules for blocking collective operations in 
threaded environments.

Rationale. Matching blocking and nonblocking collectives is not allowed 
because an implementation might use different communication algorithms 
for the two cases. Blocking collectives may be optimized for minimal 
time to completion, while nonblocking collectives may balance time to 
completion with CPU overhead and asynchronous progression.

The use of tags for collective operations can prevent certain hardware 
optimizations. (End of rationale.)

Advice to users. If program semantics require matching blocking and 
nonblocking collectives, then a nonblocking collective operation can be 
initiated and immediately completed with a blocking wait to emulate 
blocking behavior. (End of advice to users.)

In terms of data movements, each nonblocking collective operation has 
the same effect as its blocking counterpart for intracommunicators and 
intercommunicators after completion. The use of the “in place” option is 
allowed exactly as described for the corresponding blocking collective 
operations. Likewise, upon completion, nonblocking collective reduction 
operations have the same effect as their blocking counterparts, and the 
same restrictions and recommendations on reduction orders apply.

Progression rules for nonblocking collectives are similar to progression 
of nonblocking point-to-point operations, refer to Section ?? (Section 
3.7.4).

Advice to implementors. Nonblocking collective operations can be 
implemented with local execution schedules [3] using nonblocking 
point-to-point communication and a reserved tag-space. (End of advice to 
implementors.)