[mpi3-coll] New nonblocking collective intro text

Wed Jan 28 17:09:06 CST 2009

Adam:

Looks good generally. I suggest a few changes below.

Bronis

On Wed, 28 Jan 2009, Adam Moody wrote:

> Since this change was to smooth the flow, it's easiest to present the
> new text in complete form (so it can be read in continuous flow). The
> content is practically unchanged -- only the presentation is rearranged.
> I've tried to incorporate everyone's suggestions from yesterday's
> telecon, but it's possible that I may have missed some. Could you please
> review this new text to check that I have included your suggestions?
>
> In particular:
>
> (*) I modified Jesper's first line on page 49, line 26 from: "As
> described in Section ?? (Section 3.7), the performance of many
> applications can be improved by overlapping communication and
> computation, and many systems enable this." to: "As described in Section
> ?? (Section 3.7), performance on many systems can be improved by
> overlapping communication and computation." This better matches the
> point-to-point text (in 2.1 document, see p 47, line 42).
>
> (*) I added an advice to users section regarding synchronization
> side-effects (derived from collectives intro in 2.1 doc on p 131, lines
> 35-43).
>
> (*) In the current document, on page 50, line 36 there is a broken
> sentence, which was apparently introduced after the call yesterday. I
> left this change out, since I wasn't sure of the exact intent. My
> proposed text is based on the previous version.
>
> If I left out other changes, please let me know. Also, I can try to post
> more detailed changes later, but it'll take more effort and I wanted to
> get this out there to give folks a head start. Most of the changes are
> obvious when you read this text side-by-side with the current text.
> Could a few folks please review this carefully and compare it to the
> current text?
> Thanks,
> -Adam
>
>
> ---------------------------------
> My proposal is to replace lines from p 49, line 26 to p 51, line 15 with
> the following:
>
> ---------------------------------
>
>
> As described in Section ?? (Section 3.7), performance on many systems
> can be improved by overlapping communication and computation.
> Nonblocking collectives combine the potential benefits of nonblocking
> point-to-point operations to exploit overlap and to avoid
> synchronization with the optimized implementation and message scheduling
> provided by collective operations [1,4].  One way of doing this would be
> to perform a blocking collective operation in a separate thread. An
> alternative mechanism that often leads to better performance (i.e.,
> avoids context switching, scheduler overheads, and thread management
> [2]) is to use nonblocking collective communication.
>
> The nonblocking collective communication model is similar to the model
> used in nonblocking point-to-point communication.  A nonblocking start
> call initiates the collective operation, but does not complete it.  A
> separate complete call is needed to complete the operation.  Once

The above should use "completion"

separate completion call is needed to complete the operation.  Once

> initiated, the operation may progress independently of any computation
> or other communication at participating processes.  In this manner,
> nonblocking collectives can mitigate synchronizing effects of collective
> operations by running them in the "background."  In addition to enabling
> communication-computation overlap, nonblocking collectives can perform
> collective operations on overlapping communicators that would lead to
> deadlock with blocking operations. The semantic advantages of
> nonblocking collectives can also be useful in combination with
> point-to-point communication.
>
> As in the nonblocking point-to-point case, all start calls are local
> and return immediately irrespective of the status of other processes.
> The start call initiates the operation which indicates that the system
> may start to copy data out of the send buffer and into the receive
> buffer.  Once intiated, all associated send buffers should not be
> modified and all associated receive buffers should not be accessed until
> the collective operation completes.  The start call returns a request
> handle which must be passed to a complete call to complete the

Hmm. I don't know whether to consider the clause required.
I think it is not so "which" is OK but you need a comma:

handle, which must be passed to a complete call to complete the

> operation.
>
> All completion calls (e.g., MPI_WAIT) described in Section ?? (Section
> 3.7.3) are supported for nonblocking collective operations.  Similarly
> to the blocking case, collective operations are considered to be
> complete when the local part of the operation is finished, i.e., the
> semantics of the operation are guaranteed and all buffers can be safely
> accessed and modified.  Completion does not imply that other processes
> have completed or even started the operation unless otherwise specified
> in or implied by the description of the operation.

A new sentence that fits with the above has been lost (I
think it was Jesper's addition). Here is the current text
of it in the rev 2 PDF:

Completion of a particular nonblocking collective operations does
not imply completion of any other posted nonblocking collective (or
send-receive) operations, whether they are posted before or after the
completed operation.

Now that I put it here I notice a small error (the unneeded 's'
on the first "operations"). I think an "also" would be good in
the new context. So, put the following at the end of the above
paragraph:

Completion of a particular nonblocking collective operation also does
not imply completion of any other posted nonblocking collective (or
send-receive) operations, whether they are posted before or after the
completed operation.

> Advice to users:
> Some implementations may have the effect of synchronizing processes
> during the completion of a nonblocking collective.  A correct, portable
> program can not rely on such synchronization side-effects, however, one

"cannot" is one word:

program cannot rely on such synchronization side-effects, however, one

> must program so as to allow them.  (End of advice to users.)
>
> Upon returning from a completion call in which a nonblocking colletive
> completes, the MPI_ERROR field in the associated status object is set
> appropriately to indicate any errors.  The values of the MPI_SOURCE and
> MPI_TAG fields are undefined.  It is valid to mix different request
> types (i.e., any combination of collective requests, I/O requests,
> generalized requests, or point-to-point requests) in functions that
> enable multiple completions (e.g., MPI_WAITALL).  It is erroneous to
> call MPI_REQUEST_FREE or MPI_CANCEL with a request for a nonblocking
> collective operation.
>
> Rationale.  Freeing an active nonblocking collective request could
> cause similar problems as discussed for point-to-point requests (see
> Section ?? (3.7.3)).  Cancelling a request is not supported because the
> semantics of this operation are not well-defined.  (End of rationale.)
>
> Multiple nonblocking collective operations can be outstanding on a
> single communicator.  If the nonblocking collective causes some system
> resource to be exhausted, then it will fail and generate an MPI
> exception.  Quality implementations of MPI should ensure that this
> happens only in pathological cases.  That is, an MPI implementation
> should be able to support a large number of pending nonblocking
> collective operations.
>
> Unlike point-to-point operations, nonblocking collective operations do
> not match with blocking collectives, and collective operations do not
> have a tag argument.  All processes must call collective operations
> (blocking and nonblocking) in the same order per communicator.  In
> particular, once a process calls a collective operation, all other
> processes in the communicator must eventually call the same collective
> operation, and no other collective operation in between.  This is
> consistent with the ordering rules for blocking collective operations in
> threaded environments.
>
> Rationale:
> Matching blocking and nonblocking collectives is not allowed because an
> implementation might use different communication algorithms
> for the two cases.  Blocking collectives may be optimized for minimal
> time to completion, while nonblocking collectives may balance time to
> completion with CPU overhead and asynchronous progression.
>
> The use of tags for collective operations can prevent certain hardware
> optimizations.  (End of rationale.)
>
> Advice to users:
> If program semantics require matching blocking and nonblocking
> collectives, then a nonblocking collective operation can be initiated
> and immediately completed with a blocking wait to emulate blocking
> behavior.  (End of advice to users.)
>
> In terms of data movements, each nonblocking collective operation has
> the same effect as its blocking counterpart for intracommunicators and
> intercommunicators after completion.  The use of the "in place" option
> is allowed exactly as described for the corresponding blocking
> collective operations.  Likewise, upon completion, the nonblocking
> collective reduction operations have the same effect as their blocking
> counterparts, and the same restrictions and recommendations on reduction
> orders apply.
>
> Progression rules for nonblocking collectives are similar to
> progression of nonblocking point-to-point operations, refer to Section
> ?? (Section 3.7.4).
>
> Advice to implementors. Nonblocking collective operations can be
> implemented with local execution schedules [3] using nonblocking
> point-to-point communication and a reserved tag-space. (End of advice to
> implementors.)
>
>
>
>
> _______________________________________________
> mpi3-coll mailing list
> mpi3-coll at lists.mpi-forum.org
> http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-coll
>
>