<div dir="ltr"><div>Please see below:</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 17, 2020 at 5:34 AM Joseph Schuchart via mpiwg-hybridpm <<a href="mailto:mpiwg-hybridpm@lists.mpi-forum.org">mpiwg-hybridpm@lists.mpi-forum.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">1) I wonder what the benefit is of integrating stream support into MPI <br>

libraries over accelerator vendors providing their specific APIs on top <br>

of what MPI already offers? My understanding from the CUDA graph API is <br>

that you can add a host node that is a callback executed on the CPU. <br>

That is what I imagine the MPI library would use and it is what a <br>

third-party library could do as well, right? Otherwise, what is missing <br>

from the MPI API?<br></blockquote><div><br></div><div>Host callbacks are not able to make CUDA calls. Today, CUDA-Aware MPI functions perform CUDA pointer queries to find out where a buffer is, launch data pack/unpack kernels, perform CUDA memory copies, etc.</div><div><br></div><div>I've thought quite a bit about the question of which direction makes more sense to enable interoperability. One explanation for why I think interoperability from MPI => CUDA makes more sense is that MPI is *already* using CUDA/HIP/OneAPI internally to interact with the data on the accelerator (which is why you can't make MPI calls in host callbacks). In the CUDA/HIP models, a number of things MPI does to move data like launching kernels, performing memcpys, etc are performed on streams. Today, MPI libraries use their own internal streams, which are disconnected from the streams the user is using to launch kernels. This forces stream synchronization and causes us to lose the pipelining and overhead hiding benefits from streams.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">2) The CUDA stream and graph APIs seem very similar to task dependencies <br>

in OpenMP, with the same complications when combined with MPI. I think <br>

Martin hinted at this last night: MPI adds dependencies between nodes in <br>

one or more graphs that are not exposed to the CUDA scheduler, which <br>

opens the door for deadlocks. I think we should strive to do better. In <br>

OpenMP, detached tasks (and the events used to complete them) provide a <br>

user-controlled completion mechanism. This may be a model that could be <br>

picked up by the accelerator APIs. Joachim and I have shown that this <br>

model is easily coupled with callback-based completion notification in <br>

MPI :) So maybe the burden does not have to be all on MPI here...<br></blockquote><div><br></div><div>The possibility of deadlocking streams is really an artifact of the queueing model that MPI is being used with, and I think of it as being orthogonal to this proposal. As I mentioned, graphs are a way to avoid deadlock entirely. All communication in NCCL is stream-based. One way that NCCL avoids deadlocking is through communication batching, which this proposal provides through the start-all and wait-all on stream operations.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

The discussion last night got held up by many questions, partly because <br>

two concepts were mixed: device-side communication initiation (which I <br>

think is very interesting) and stream integration (which I am less <br>

convinced of). It might be worth splitting the two aspects into separate <br>

discussions since you can have one without the other and it might make <br>

it easier for people to follow along.<br></blockquote><div><br></div><div>This first "kickoff" slide deck was really intended to motivate and get our creativity going (seems to have met these goals). Mixing these two concepts is showing an end-state that I think is very appealing to GPU users. But, these really aren't the right slides for us to have a deeper technical discussion (please bear with me, I'll make better slides). I agree we need to separate these topics going forward. The partitioned discussion should stay in the persistence WG where it has been homed.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Since the problem of passing va-args through PMPI came up last night: <br>

one way to deal with it would be to provide MPI_Wait_enqueuev(reqs, <br>

type, va_args) to allow PMPI wrappers to inspect the va-args and pass <br>

them on to MPI. This is the model that printf/vprintf and friends are <br>

using. I'm not sure  whether that works for Fortran though...<br>

<br>

I hope this is a good place to discuss these topics so everyone feel <br>

free to comment. Otherwise just take it as input for the next WG meeting :)<br></blockquote><div><br></div><div>I've got a mountain of feedback to work with and I'll work on preparing better materials for us to explore this topic. I will also grab one of our stream/graph experts to come and talk to the WG. They can do a much better job of describing the benefits of the programming model than I have done. I also have a topic lined up to discuss how NCCL, NVSHMEM, and possibly MPLib use streams to help set the context.</div><div><br></div><div>Appreciate the thoughtful discussion, and look forward to much more. :)</div><div><br></div><div> ~Jim.</div></div></div>