[mpiwg-hybridpm] Questions about MPI CUDA stream integration
Joseph Schuchart
schuchart at hlrs.de
Thu Dec 17 04:34:09 CST 2020
Jim, all,
Thanks for your presentation yesterday (and last time). I had a bunch of
questions but held back in hopes that we could go through the rest of
the slides. Maybe it's better to have this discussion on the mailing
list and save the precious hour in the WG meeting. In essence, my points
boil down to these two:
1) I wonder what the benefit is of integrating stream support into MPI
libraries over accelerator vendors providing their specific APIs on top
of what MPI already offers? My understanding from the CUDA graph API is
that you can add a host node that is a callback executed on the CPU.
That is what I imagine the MPI library would use and it is what a
third-party library could do as well, right? Otherwise, what is missing
from the MPI API?
2) The CUDA stream and graph APIs seem very similar to task dependencies
in OpenMP, with the same complications when combined with MPI. I think
Martin hinted at this last night: MPI adds dependencies between nodes in
one or more graphs that are not exposed to the CUDA scheduler, which
opens the door for deadlocks. I think we should strive to do better. In
OpenMP, detached tasks (and the events used to complete them) provide a
user-controlled completion mechanism. This may be a model that could be
picked up by the accelerator APIs. Joachim and I have shown that this
model is easily coupled with callback-based completion notification in
MPI :) So maybe the burden does not have to be all on MPI here...
The discussion last night got held up by many questions, partly because
two concepts were mixed: device-side communication initiation (which I
think is very interesting) and stream integration (which I am less
convinced of). It might be worth splitting the two aspects into separate
discussions since you can have one without the other and it might make
it easier for people to follow along.
Since the problem of passing va-args through PMPI came up last night:
one way to deal with it would be to provide MPI_Wait_enqueuev(reqs,
type, va_args) to allow PMPI wrappers to inspect the va-args and pass
them on to MPI. This is the model that printf/vprintf and friends are
using. I'm not sure whether that works for Fortran though...
I hope this is a good place to discuss these topics so everyone feel
free to comment. Otherwise just take it as input for the next WG meeting :)
Cheers
Joseph
--
Dr-Ing. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuchart at hlrs.de
More information about the mpiwg-hybridpm
mailing list