[mpiwg-hybridpm] Cholesky code with "continuation"
Joachim Protze
protze at itc.rwth-aachen.de
Wed Mar 30 12:29:18 CDT 2022
Hi Dan, Benson, HACC-wg,
The paper I referred to in the virtual meeting was the Euro-MPI'20 paper
about MPI_Detach: https://dl.acm.org/doi/abs/10.1145/3416315.3416323
In the paper we also sketch, how continuations can be used to implement
C++ code like:
https://github.com/mpi-forum/mpi-issues/issues/288#issuecomment-619053687
The Cholesky code with MPI_Detach aka Continuation:
The communication task in
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/detach-deps/ch_ompss.c#L92
finishes execution immediately after calling MPI_Detach in
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/detach-deps/ch_ompss.c#L109
Because of the detach clause on the task (OpenMP 5.0 feature), the task
only completes, when the callback is called. Releasing the task
dependencies depends on task completion. Effectively, this allows to
span task dependence graphs across MPI ranks.
A copy of my implementation of MPI_Detach as a wrapper (detach.cpp) is
next to this file. If you want to try and run the code, make sure to
export MPIX_DETACH=progress, so that the wrapper starts the progress thread.
The Cholesky code without detach:
The communication task explicitly waits on completion of the MPI
communication:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_ompss.c#L65
in this wait function:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_ompss.c#L86
Implemented in here:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_common.c#L82
The taskyield allows other tasks to get scheduled while waiting the MPI
communication to complete. Unfortunately, the OpenMP semantics of
taskyield are as weak as the progress guarantees from the MPI side ;)
As a result, this code can deadlock, if too many communication tasks are
created and only the OMPSS runtime can actually avoid the deadlock.
The other versions of the code reduce the possible concurrency more and
more, avoiding deadlocks, but increasing the execution time.
Just recently, I did an extended performance analysis of the code, which
shows that the overhead for the detach version displayed in the paper is
purely a result of increasing data transfer with increasing number of
nodes, amplified by the reduced amount of work per rank due to strong
scaling. I.e., at some point there is not enough work available to
overlap communication.
Best
Joachim
--
Dr. rer. nat. Joachim Protze
IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074 Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5327 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-hybridpm/attachments/20220330/7a08802c/attachment.p7s>
More information about the mpiwg-hybridpm
mailing list