[mpiwg-hybridpm] Cholesky code with "continuation"

Wed Mar 30 12:29:18 CDT 2022

Hi Dan, Benson, HACC-wg,

The paper I referred to in the virtual meeting was the Euro-MPI'20 paper 
about MPI_Detach: https://dl.acm.org/doi/abs/10.1145/3416315.3416323

In the paper we also sketch, how continuations can be used to implement 
  C++ code like: 
https://github.com/mpi-forum/mpi-issues/issues/288#issuecomment-619053687

The Cholesky code with MPI_Detach aka Continuation:

The communication task in

https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/detach-deps/ch_ompss.c#L92

finishes execution immediately after calling MPI_Detach in

https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/detach-deps/ch_ompss.c#L109

Because of the detach clause on the task (OpenMP 5.0 feature), the task 
only completes, when the callback is called. Releasing the task 
dependencies depends on task completion. Effectively, this allows to 
span task dependence graphs across MPI ranks.

A copy of my implementation of MPI_Detach as a wrapper (detach.cpp) is 
next to this file. If you want to try and run the code, make sure to 
export MPIX_DETACH=progress, so that the wrapper starts the progress thread.

The Cholesky code without detach:

The communication task explicitly waits on completion of the MPI 
communication:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_ompss.c#L65

in this wait function:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_ompss.c#L86

Implemented in here:
https://github.com/RWTH-HPC/cholesky_omptasks/blob/mpi-detach/fine-deps/ch_common.c#L82

The taskyield allows other tasks to get scheduled while waiting the MPI 
communication to complete. Unfortunately, the OpenMP semantics of 
taskyield are as weak as the progress guarantees from the MPI side ;)
As a result, this code can deadlock, if too many communication tasks are 
created and only the OMPSS runtime can actually avoid the deadlock.

The other versions of the code reduce the possible concurrency more and 
more, avoiding deadlocks, but increasing the execution time.

Just recently, I did an extended performance analysis of the code, which 
shows that the overhead for the detach version displayed in the paper is 
purely a result of increasing data transfer with increasing number of 
nodes, amplified by the reduced amount of work per rank due to strong 
scaling. I.e., at some point there is not enough work available to 
overlap communication.

Best
Joachim

-- 
Dr. rer. nat. Joachim Protze

IT Center
Group: High Performance Computing
Division: Computational Science and Engineering
RWTH Aachen University
Seffenter Weg 23
D 52074  Aachen (Germany)
Tel: +49 241 80- 24765
Fax: +49 241 80-624765
protze at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5327 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-hybridpm/attachments/20220330/7a08802c/attachment.p7s>