[mpiwg-hybridpm] Hybrid/Accelerator WG Meeting
james.dinan at gmail.com
Fri Mar 19 13:56:13 CDT 2021
CUDA streams are a virtual resource and they are mapped to a fixed number
of hardware channels allocated to the CUDA context. Hardware channels and
streams both have FIFO execution semantics. The cuStreamWaitValue64
operation is executed by the GPU control processor (this occurs outside of
the SM, thus not involving a thread block) and has the effect of blocking
the hardware channel. So, operations (e.g. mem ops, copies, kernel
launches, etc) that happen to land in the same hardware channel behind the
cuStreamWaitValue64 are also blocked.
For example, a user could do something like MPI_Send_on_stream(...,
stream0) and MPI_Recv_on_stream(..., stream1). CUDA can map stream0 and
stream1 to the same hardware channel with any interleaving of operations
from the streams (as long as FIFO ordering is preserved for each stream
individually). Assuming that the cuStreamWaitValue64 for the receive
operation is inserted into the hardware channel first, it will block the
cuStreamWriteValue64 operation corresponding to the send, leading to
A workaround that could be used today is for users to insert CUDA events to
force the send to be performed prior to the recv. However, this will
unnecessarily serialize the streams in (likely) cases when they map to
separate hardware channels. The situation is obviously not ideal, and we
are looking at ways to fix this.
On Thu, Mar 11, 2021 at 10:57 AM Dan Holmes <danholmes at chi.scot> wrote:
> Hi Jim,
> The CPU callback operation you describe seems to be only one-way (GPU
> notifying/triggering CPU), but a reverse mechanism would be needed to
> complete the pattern, as discussed on the call?
> If the cuStreamWaitValue64 operation works like the “wait" in Stephen’s
> slides, i.e. does not actually sit and wait (blocking a whole thread
> block), but makes the FIFO stream not runnable until the wait condition is
> satisfied, then that looks promising for a return path. Your hint "memops
> are processed from within the GPU control processor that manages stream
> execution” suggests this is true. Doesn’t that provide all that is needed
> to get this working?
> The GPU has a way to signal to the CPU that the send data buffer is ready
> to be sent.
> The CPU has a way to signal to the GPU that the send data buffer can be
> The CPU has a way to signal to the GPU that the recv data buffer is ready
> to be consumed.
> The GPU has a way to signal to the CPU that the recv data buffer can be
> The CPU needs a helper thread to monitor the memory locations targeted by
> the memops. This could use general requests in MPI and rely on the MPI
> progress “thread”/mechanism or be a separate/dedicated CPU thread.
> The GPU can pack/unpack data; the CPU can use GPUDirect.
> Nothing further seems to be necessary for a complete/functional
> implementation. Future changes — to MPI and/or to GPUs/CUDA — are only
> needed/desired to reduce/eliminate performance bottlenecks in this pattern.
> Am I getting that right?
> Dr Daniel Holmes PhD
> Executive Director
> Chief Technology Officer
> CHI Ltd
> danholmes at chi.scot
> On 11 Mar 2021, at 15:07, Jim Dinan via mpiwg-hybridpm <
> mpiwg-hybridpm at lists.mpi-forum.org> wrote:
> Unfortunately, CPU callbacks are not a perfect solution on their own. CUDA
> does not allow CUDA calls from within CPU callbacks, so for example you
> would not be able to launch data packing kernels or peer-to-peer copy
> operations from within the callback. However, you can use CPU callbacks to
> signal a thread in the MPI runtime to process the operation. Another option
> in this design space is to use CUDA memops (e.g. cuStreamWriteValue64 or
> cuStreamWaitValue64) to coordinate between CUDA streams and MPI
> communication helper threads. Because memops are processed from within the
> GPU control processor that manages stream execution, I would expect these
> to have lower overheads than CPU callbacks (although I haven't measured
> On Wed, Mar 10, 2021 at 10:08 PM Junchao Zhang <junchao.zhang at gmail.com>
>> Thanks for the slides. In Stephen's presentation today, it seems
>> with existing techniques, i.e, CPU MPI callback nodes in CUDA graphs, one
>> can solve the MPI GPU problem. Is my understanding correct?
>> --Junchao Zhang
>> On Wed, Mar 10, 2021 at 8:34 PM Jim Dinan via mpiwg-hybridpm <
>> mpiwg-hybridpm at lists.mpi-forum.org> wrote:
>>> Hi All,
>>> I've posted Stephen's slides:
>>> On Mon, Mar 8, 2021 at 11:21 AM Jim Dinan <james.dinan at gmail.com> wrote:
>>>> Hi All,
>>>> We have an invited speaker this week at the HACC WG:
>>>> Topic: CUDA Deep Dive For the MPI Forum HACC WG
>>>> When: Wednesday, March 10 10-11:00am ET
>>>> Connection Info: https://github.com/mpiwg-hybrid/hybrid-issues/wiki
>>>> Speaker: Stephen Jones, NVIDIA
>>>> Stephen Jones is one of the architects of CUDA, working on defining the
>>>> language, the platform, and the hardware that it runs on, to span the needs
>>>> of parallel programming from high performance computing to artificial
>>>> intelligence. Prior to his present position, he lead the Simulation &
>>>> Analytics group at SpaceX, working on large-scale simulation of rocket
>>>> engines. He has worked in diverse other industries, including networking,
>>>> CAD/CAM, and scientific computing. He has been a part of CUDA since 2008.
>>>> PS - Apologies for cross posting on the main list. If you would like to
>>>> continue receiving emails relating to the Hybrid & Accelerator WG, please
>>>> sign up for the mailing list here:
>>> mpiwg-hybridpm mailing list
>>> mpiwg-hybridpm at lists.mpi-forum.org
> mpiwg-hybridpm mailing list
> mpiwg-hybridpm at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-hybridpm