[mpiwg-hybridpm] Async MPI ops doc for discussion tomorrow

Wed Jul 6 05:20:44 CDT 2022

This is the beginning of a plain text version, which might be more accessible for some.  I got tired of cleaning up the OCR output after a while.

Composable Asynchronous Communication Graphs a.k.a Project Delorean

Goal: Improve performance of MPI operations in 3 primary areas: reduce synchronization penalty, overlap communication & compute and enable more MPI operations to be executed concurrently with application computations.

Mechanism to Achieve Goals:
Building a robust set of extensions to add “true” asynchronous operations to MPI will achieve these goals
Additional benefits of achieving the primary goals in an elegant and well designed way will be an improved user experience when using asynchronous (currently “non-blocking”) operations, hiding latency of operations, exposing more opportunities for optimizing performance of MPI operations, offloading more operations to the networking hardware, and enabling the construction of powerful data movement orchestration operations.

What this is _Not_
Although the capabilities described here may operate best with an MPI implementation that executes them with a thread or dedicated hardware that is _not_ a requirement The capabilities described here will operate correctly with the typical progress model for MPI If threads or hardware offload are not used the “asynchronous" operations describe here may be implemented in the same way as the non blocking operations are currently The operations described here are not designed to provide the capabilities of a computationally focused task exeration framework such as CUDA etc Although a call to asynchronously execute an application provided call back fun it ion is provided it is intended as an escape hatch mechanism for short running tasks Likewise the asynchronous operations described here are primarily datamovement operations that have minimal performance impact on an application's execution time A high quality implementation will schedule asynchronous data movement operations quickly and give up the CPU as soon as possible Applications that only use iterative solver algorithms with no opportunities for overlapping Comm with compute are unlikely to benefit from the capabilities described here at least in areas of their workflow that can't be pipelined/overlapped

Comparison to Continuations + MPIX.Stream
At first glance, the capabilities here appear similar to the current continuations proposal or the MPIX.Stream extensions Continuations allow for attaching a user call back fo a non-bloating operation invoking the call back when the operation completes Although the capabilities here include an asynchronous user call back they go far beyond just that capability The MPIX Stream work has overlap in the area of scheduling asynchronous operations but is mainly focused on enabling communication to compute endpoints on GPUs and similar hardware software systems

Improves appt perceived comm perf Hide [AWS] latency apple Goes to o from point of view Allow more opportunities for MPI impt to optimize Merge reorder ops Powerful data movement orchestration to Take advantage of hardware offload app benefits with benchmarks that show predictions for speedups Synchronization penalty Overall comm cost Time spent in I/O operations Lack of a synchrony cost to overlap compute with...
which includes broadcasting size of next time step then halo exchange of data w/ neighbors, all of which should be performed asynchronously

Discussion
A straightforward overlap of commit compute If the of halo exchanges didn't depend on the broadcast this could be done in MPI today but dependency make this impossible to fully overlap currently

Use case #2
Want to receive a variable sized message in the background with buffer allocated asynchronously as well

Discussion:
Although using Iprobe and MPI Irecv provide partial functionality needed for this use case they can't provide the future values needed nor the dependencies between operations or the a sync memory allocation

Use case #3
Asynchronously prefetch a commressed file into a buffer on rank of then beast buffer of unknown size to other ranks

This use Case demonstrates the power of executing operations in the background All of the set up for the next time step is Tver lapped with the current time step including file IO memory allocation user operations to decompress data and communication

Use case #4

Asynchronously write a checkpoint file while the next time step is being computed

Discussion
This use case again shows fully overlapping I/O with compute In particular concurrently executing collective I/O and using the join operator to create a single dependency for the file close operation to depend on

Use case #5
Parameterize graph of data dependant a sync operations for re-use

> On 6Jul 2022, at 3:09 AM, Koziol, Quincey via mpiwg-hybridpm <mpiwg-hybridpm at lists.mpi-forum.org> wrote:
> 
> Still in “lab notebook” style, but greatly updated:   https://www.dropbox.com/s/5yee5n6aj1ljwh9/Async%20MPI%20Operations%20-%20July%205%2C%202022.pdf?dl=0 <https://www.dropbox.com/s/5yee5n6aj1ljwh9/Async%20MPI%20Operations%20-%20July%205,%202022.pdf?dl=0>
> 
> 
> Quincey
> _______________________________________________
> mpiwg-hybridpm mailing list
> mpiwg-hybridpm at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-hybridpm

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-hybridpm/attachments/20220706/9ac7fd28/attachment-0001.html>