[mpiwg-hybridpm] Questions about MPI CUDA stream integration
james.dinan at gmail.com
Thu Dec 17 10:51:46 CST 2020
Please see below:
On Thu, Dec 17, 2020 at 5:34 AM Joseph Schuchart via mpiwg-hybridpm <
mpiwg-hybridpm at lists.mpi-forum.org> wrote:
> 1) I wonder what the benefit is of integrating stream support into MPI
> libraries over accelerator vendors providing their specific APIs on top
> of what MPI already offers? My understanding from the CUDA graph API is
> that you can add a host node that is a callback executed on the CPU.
> That is what I imagine the MPI library would use and it is what a
> third-party library could do as well, right? Otherwise, what is missing
> from the MPI API?
Host callbacks are not able to make CUDA calls. Today, CUDA-Aware MPI
functions perform CUDA pointer queries to find out where a buffer is,
launch data pack/unpack kernels, perform CUDA memory copies, etc.
I've thought quite a bit about the question of which direction makes more
sense to enable interoperability. One explanation for why I think
interoperability from MPI => CUDA makes more sense is that MPI is *already*
using CUDA/HIP/OneAPI internally to interact with the data on the
accelerator (which is why you can't make MPI calls in host callbacks). In
the CUDA/HIP models, a number of things MPI does to move data like
launching kernels, performing memcpys, etc are performed on streams. Today,
MPI libraries use their own internal streams, which are disconnected from
the streams the user is using to launch kernels. This forces stream
synchronization and causes us to lose the pipelining and overhead hiding
benefits from streams.
> 2) The CUDA stream and graph APIs seem very similar to task dependencies
> in OpenMP, with the same complications when combined with MPI. I think
> Martin hinted at this last night: MPI adds dependencies between nodes in
> one or more graphs that are not exposed to the CUDA scheduler, which
> opens the door for deadlocks. I think we should strive to do better. In
> OpenMP, detached tasks (and the events used to complete them) provide a
> user-controlled completion mechanism. This may be a model that could be
> picked up by the accelerator APIs. Joachim and I have shown that this
> model is easily coupled with callback-based completion notification in
> MPI :) So maybe the burden does not have to be all on MPI here...
The possibility of deadlocking streams is really an artifact of the
queueing model that MPI is being used with, and I think of it as being
orthogonal to this proposal. As I mentioned, graphs are a way to avoid
deadlock entirely. All communication in NCCL is stream-based. One way that
NCCL avoids deadlocking is through communication batching, which this
proposal provides through the start-all and wait-all on stream operations.
> The discussion last night got held up by many questions, partly because
> two concepts were mixed: device-side communication initiation (which I
> think is very interesting) and stream integration (which I am less
> convinced of). It might be worth splitting the two aspects into separate
> discussions since you can have one without the other and it might make
> it easier for people to follow along.
This first "kickoff" slide deck was really intended to motivate and get our
creativity going (seems to have met these goals). Mixing these two concepts
is showing an end-state that I think is very appealing to GPU users. But,
these really aren't the right slides for us to have a deeper technical
discussion (please bear with me, I'll make better slides). I agree we need
to separate these topics going forward. The partitioned discussion
should stay in the persistence WG where it has been homed.
Since the problem of passing va-args through PMPI came up last night:
> one way to deal with it would be to provide MPI_Wait_enqueuev(reqs,
> type, va_args) to allow PMPI wrappers to inspect the va-args and pass
> them on to MPI. This is the model that printf/vprintf and friends are
> using. I'm not sure whether that works for Fortran though...
> I hope this is a good place to discuss these topics so everyone feel
> free to comment. Otherwise just take it as input for the next WG meeting :)
I've got a mountain of feedback to work with and I'll work on preparing
better materials for us to explore this topic. I will also grab one of our
stream/graph experts to come and talk to the WG. They can do a much better
job of describing the benefits of the programming model than I have done. I
also have a topic lined up to discuss how NCCL, NVSHMEM, and possibly MPLib
use streams to help set the context.
Appreciate the thoughtful discussion, and look forward to much more. :)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-hybridpm