<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class="">This is the beginning of a plain text version, which might be more accessible for some. I got tired of cleaning up the OCR output after a while.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Composable Asynchronous Communication Graphs a.k.a Project Delorean</div><div class=""><br class=""></div><div class="">Goal: Improve performance of MPI operations in 3 primary areas: reduce synchronization penalty, overlap communication & compute and enable more MPI operations to be executed concurrently with application computations.</div><div class=""><br class=""></div><div class="">Mechanism to Achieve Goals:</div><div class="">Building a robust set of extensions to add “true” asynchronous operations to MPI will achieve these goals</div><div class="">Additional benefits of achieving the primary goals in an elegant and well designed way will be an improved user experience when using asynchronous (currently “non-blocking”) operations, hiding latency of operations, exposing more opportunities for optimizing performance of MPI operations, offloading more operations to the networking hardware, and enabling the construction of powerful data movement orchestration operations.</div><div class=""><br class=""></div><div class="">What this is _Not_</div><div class="">Although the capabilities described here may operate best with an MPI implementation that executes them with a thread or dedicated hardware that is _not_ a requirement The capabilities described here will operate correctly with the typical progress model for MPI If threads or hardware offload are not used the “asynchronous" operations describe here may be implemented in the same way as the non blocking operations are currently The operations described here are not designed to provide the capabilities of a computationally focused task exeration framework such as CUDA etc Although a call to asynchronously execute an application provided call back fun it ion is provided it is intended as an escape hatch mechanism for short running tasks Likewise the asynchronous operations described here are primarily datamovement operations that have minimal performance impact on an application's execution time A high quality implementation will schedule asynchronous data movement operations quickly and give up the CPU as soon as possible Applications that only use iterative solver algorithms with no opportunities for overlapping Comm with compute are unlikely to benefit from the capabilities described here at least in areas of their workflow that can't be pipelined/overlapped</div><div class=""><br class=""></div><div class="">Comparison to Continuations + MPIX.Stream</div><div class="">At first glance, the capabilities here appear similar to the current continuations proposal or the MPIX.Stream extensions Continuations allow for attaching a user call back fo a non-bloating operation invoking the call back when the operation completes Although the capabilities here include an asynchronous user call back they go far beyond just that capability The MPIX Stream work has overlap in the area of scheduling asynchronous operations but is mainly focused on enabling communication to compute endpoints on GPUs and similar hardware software systems</div><div class=""><br class=""></div><div class="">Improves appt perceived comm perf Hide [AWS] latency apple Goes to o from point of view Allow more opportunities for MPI impt to optimize Merge reorder ops Powerful data movement orchestration to Take advantage of hardware offload app benefits with benchmarks that show predictions for speedups Synchronization penalty Overall comm cost Time spent in I/O operations Lack of a synchrony cost to overlap compute with...</div><div class="">which includes broadcasting size of next time step then halo exchange of data w/ neighbors, all of which should be performed asynchronously</div><div class=""><br class=""></div><div class="">Discussion</div><div class="">A straightforward overlap of commit compute If the of halo exchanges didn't depend on the broadcast this could be done in MPI today but dependency make this impossible to fully overlap currently</div><div class=""><br class=""></div><div class="">Use case #2</div><div class="">Want to receive a variable sized message in the background with buffer allocated asynchronously as well</div><div class=""><br class=""></div><div class="">Discussion:</div><div class="">Although using Iprobe and MPI Irecv provide partial functionality needed for this use case they can't provide the future values needed nor the dependencies between operations or the a sync memory allocation</div><div class=""><br class=""></div><div class="">Use case #3</div><div class="">Asynchronously prefetch a commressed file into a buffer on rank of then beast buffer of unknown size to other ranks</div><div class=""><br class=""></div><div class="">This use Case demonstrates the power of executing operations in the background All of the set up for the next time step is Tver lapped with the current time step including file IO memory allocation user operations to decompress data and communication</div><div class=""><br class=""></div><div class="">Use case #4</div><div class=""><br class=""></div><div class="">Asynchronously write a checkpoint file while the next time step is being computed</div><div class=""><br class=""></div><div class="">Discussion</div><div class="">This use case again shows fully overlapping I/O with compute In particular concurrently executing collective I/O and using the join operator to create a single dependency for the file close operation to depend on</div><div class=""><br class=""></div><div class="">Use case #5</div><div class="">Parameterize graph of data dependant a sync operations for re-use</div><div class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On 6Jul 2022, at 3:09 AM, Koziol, Quincey via mpiwg-hybridpm <<a href="mailto:mpiwg-hybridpm@lists.mpi-forum.org" class="">mpiwg-hybridpm@lists.mpi-forum.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" class="">
<div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Still in “lab notebook” style, but greatly updated: <a href="https://www.dropbox.com/s/5yee5n6aj1ljwh9/Async%20MPI%20Operations%20-%20July%205,%202022.pdf?dl=0" class="">https://www.dropbox.com/s/5yee5n6aj1ljwh9/Async%20MPI%20Operations%20-%20July%205%2C%202022.pdf?dl=0</a>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class=""><span class="Apple-tab-span" style="white-space:pre"></span>Quincey</div>
</div>
_______________________________________________<br class="">mpiwg-hybridpm mailing list<br class=""><a href="mailto:mpiwg-hybridpm@lists.mpi-forum.org" class="">mpiwg-hybridpm@lists.mpi-forum.org</a><br class="">https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm<br class=""></div></blockquote></div><br class=""></body></html>