<div dir="ltr"><div>In PETSc, we don't use non-contiguous MPI data types, so packing is not a problem for us. It seems if MPI can send/recv data without participation of GPU, then it is a workable solution in limited cases. My main concern about CUDA-graph is it is not modularizable.</div>1) Wtih cudaGraphAddKernelNode() etc, one has to see the whole graph at all. If kernels are spread and deeply wrapped in CPU routines as in PETSc, at a higher level, one doesn't know the kernels and hence nodes in the graph.<div>2) With graph capture, again, a caller subroutine can not guarantee a callee subroutine will execute the same path in next iteration as the one when it was captured. <br><div><br></div><div>--Junchao Zhang<br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 11, 2021 at 9:07 AM Jim Dinan <<a href="mailto:james.dinan@gmail.com">james.dinan@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Unfortunately, CPU callbacks are not a perfect solution on their own. CUDA does not allow CUDA calls from within CPU callbacks, so for example you would not be able to launch data packing kernels or peer-to-peer copy operations from within the callback. However, you can use CPU callbacks to signal a thread in the MPI runtime to process the operation. Another option in this design space is to use CUDA memops (e.g. cuStreamWriteValue64 or cuStreamWaitValue64) to coordinate between CUDA streams and MPI communication helper threads. Because memops are processed from within the GPU control processor that manages stream execution, I would expect these to have lower overheads than CPU callbacks (although I haven't measured this).</div><div><br></div><div> ~Jim.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 10, 2021 at 10:08 PM Junchao Zhang <<a href="mailto:junchao.zhang@gmail.com" target="_blank">junchao.zhang@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Jim, <div> Thanks for the slides. In Stephen's presentation today, it seems with existing techniques, i.e, CPU MPI callback nodes in CUDA graphs, one can solve the MPI GPU problem. Is my understanding correct?</div><div> </div><div> Thanks. </div><div><div><div dir="ltr"><div dir="ltr">--Junchao Zhang</div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 10, 2021 at 8:34 PM Jim Dinan via mpiwg-hybridpm <<a href="mailto:mpiwg-hybridpm@lists.mpi-forum.org" target="_blank">mpiwg-hybridpm@lists.mpi-forum.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr">Hi All,<div><br></div><div>I've posted Stephen's slides: <a href="https://github.com/mpiwg-hybrid/hybrid-issues/tree/master/slides" target="_blank">https://github.com/mpiwg-hybrid/hybrid-issues/tree/master/slides</a><br></div><div><br></div><div>Best,</div><div> ~Jim.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Mar 8, 2021 at 11:21 AM Jim Dinan <<a href="mailto:james.dinan@gmail.com" target="_blank">james.dinan@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi All,<br><div><br></div><div>We have an invited speaker this week at the HACC WG:</div><div><br></div><div>Topic: CUDA Deep Dive For the MPI Forum HACC WG</div><div>When: Wednesday, March 10 10-11:00am ET</div><div>Connection Info: <a href="https://github.com/mpiwg-hybrid/hybrid-issues/wiki" target="_blank">https://github.com/mpiwg-hybrid/hybrid-issues/wiki</a></div><div><br></div><div>Speaker: Stephen Jones, NVIDIA</div><div><br></div><div>Stephen Jones is one of the architects of CUDA, working on defining the language, the platform, and the hardware that it runs on, to span the needs of parallel programming from high performance computing to artificial intelligence. Prior to his present position, he lead the Simulation & Analytics group at SpaceX, working on large-scale simulation of rocket engines. He has worked in diverse other industries, including networking, CAD/CAM, and scientific computing. He has been a part of CUDA since 2008.<br></div><div><br></div><div>Cheers,</div><div> ~Jim.</div><div><br></div><div>PS - Apologies for cross posting on the main list. If you would like to continue receiving emails relating to the Hybrid & Accelerator WG, please sign up for the mailing list here: <a href="https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm" target="_blank">https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm</a>.</div></div>
</blockquote></div></div>
_______________________________________________<br>
mpiwg-hybridpm mailing list<br>
<a href="mailto:mpiwg-hybridpm@lists.mpi-forum.org" target="_blank">mpiwg-hybridpm@lists.mpi-forum.org</a><br>
<a href="https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm" rel="noreferrer" target="_blank">https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm</a><br>
</blockquote></div>
</blockquote></div></div>
</blockquote></div>