[Mpi-comments] Collective operations and synchronization

Sun Nov 25 15:38:54 CST 2012

These questions/comments relate to the final MPI 3.0 specification at 
<URL:http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf>.  All of 
these comments relate to collectives on intracommunicators; collective 
semantics on intercommunicators are very different, but similar issues are 
likely to occur in that case as well.

It seems to be weakly specified which synchronization behavior can be 
inferred from various collective operations.  For example, does MPI_Reduce 
not complete on the root unless it has been entered on each other process 
in the communicator?  Although it would be nearly impossible to do 
otherwise in a general-purpose implementation, I could imagine 
compiler-based optimizations in which the compiler determines that certain 
processes will contribute fixed values and thus does not send messages 
from those processes.  Line 8 of page 40 appears to prevent this type of 
optimization (removing messages completely without coalescing them into 
other messages) for point-to-point communication.  Also, seemingly related 
collectives have different synchronization behavior stated:

1. MPI_Gather is required to have the synchronization described above by 
lines 11-19 of page 150, while MPI_Reduce is not required to have it.

2. MPI_Scatter is required to wait on every non-root process until the 
root enters it (line 45 of page 159-line 3 of page 160), while the 
specification of MPI_Bcast does not require this.  Note that line 10 of 
page 218 does not seem to apply to this case, since that text appears to 
be about whether MPI_Bcast waits to complete on the root until the other 
processes have reached it (the converse of what MPI_Scatter requires).

3. MPI_Allgather (lines 40-45 of page 165) and MPI_Alltoall (lines 42-48 
of page 168) are required to act as barriers, while MPI_Allreduce is not. 
MPI_Reduce_scatter_block has non-normative text (lines 11-15 of page 191) 
stating that it is "equivalent" to MPI_Reduce + MPI_Scatter, which means 
that the root must reach it before other processes complete it but does 
not require a full barrier unless MPI_Reduce has stronger synchronization 
behavior.

Thus, seemingly similar collectives appear to have different constraints 
on synchronization.  The place in which this matters is in certain 
algorithms for distributed computing that use an operation such as 
MPI_Allreduce or its non-blocking equivalent to act as both a reduction 
operation and as a full barrier, which seems "obviously" correct but does 
not seem to be by the strict wording of the standard.

Am I understanding the wording correctly?  Are the descriptions given 
above what is desired for those collectives that do have synchronization 
requirements given?  Should the others be strengthened, perhaps using "as 
if" versions of their "obvious" implementations?

-- Jeremiah Willcock