[mpiwg-hybridpm] Questions about MPI CUDA stream integration

Thu Dec 17 04:34:09 CST 2020

Jim, all,

Thanks for your presentation yesterday (and last time). I had a bunch of 
questions but held back in hopes that we could go through the rest of 
the slides. Maybe it's better to have this discussion on the mailing 
list and save the precious hour in the WG meeting. In essence, my points 
boil down to these two:

1) I wonder what the benefit is of integrating stream support into MPI 
libraries over accelerator vendors providing their specific APIs on top 
of what MPI already offers? My understanding from the CUDA graph API is 
that you can add a host node that is a callback executed on the CPU. 
That is what I imagine the MPI library would use and it is what a 
third-party library could do as well, right? Otherwise, what is missing 
from the MPI API?

2) The CUDA stream and graph APIs seem very similar to task dependencies 
in OpenMP, with the same complications when combined with MPI. I think 
Martin hinted at this last night: MPI adds dependencies between nodes in 
one or more graphs that are not exposed to the CUDA scheduler, which 
opens the door for deadlocks. I think we should strive to do better. In 
OpenMP, detached tasks (and the events used to complete them) provide a 
user-controlled completion mechanism. This may be a model that could be 
picked up by the accelerator APIs. Joachim and I have shown that this 
model is easily coupled with callback-based completion notification in 
MPI :) So maybe the burden does not have to be all on MPI here...

The discussion last night got held up by many questions, partly because 
two concepts were mixed: device-side communication initiation (which I 
think is very interesting) and stream integration (which I am less 
convinced of). It might be worth splitting the two aspects into separate 
discussions since you can have one without the other and it might make 
it easier for people to follow along.

Since the problem of passing va-args through PMPI came up last night: 
one way to deal with it would be to provide MPI_Wait_enqueuev(reqs, 
type, va_args) to allow PMPI wrappers to inspect the va-args and pass 
them on to MPI. This is the model that printf/vprintf and friends are 
using. I'm not sure  whether that works for Fortran though...

I hope this is a good place to discuss these topics so everyone feel 
free to comment. Otherwise just take it as input for the next WG meeting :)

Cheers
Joseph
-- 
Dr-Ing. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuchart at hlrs.de