[Mpi3-hybridpm] Helper threads via generalised requests
Daniel Holmes
dholmes at staffmail.ed.ac.uk
Wed Jun 19 07:28:31 CDT 2013
Following the recent discussions in the June 2013 MPI Forum meeting and
in the latest Hybrid WG teleconference, I have reviewed ticket 217 and
the proposed text changes to the External Interfaces chapter.
I have several comments on the code example provided.
1) it looks like the code is attempting to do a global reduction
operation by first doing a local computation using OpenMP and then using
MPI_Allreduce to include the partial sums computed by other MPI
processes. However, there is no connection between "newval" (the result
of the local computation) and "sendbuf"/"recvbuf" (the partial sums
to/from other processes).
2) the local computation would be more naturally expressed with an
OpenMP reduction operation rather than an OpenMP for loop and an OpenMP
critical section.
3) the OpenMP memory model requires an OpenMP barrier after the
MPI_TEAM_LEAVE and before the final OpenMP for loop. This is to ensure
that the values of sendbuf and recvbuf are flushed from each thread's
temporary view of memory into the common view of memory that all threads
can access.
4) the local amount of work involved in the MPI_Allreduce does not seem
to warrant the overhead of thread-synchronisation, unless the sendbuf
and recvbuf arrays are *huge*. Is this a common use-case for MPI_Allreduce?
5) the hoped-for performance benefit comes from threads blocking in an
MPI function that achieves no useful work except for allowing those
threads to assist with internal MPI operations initiated by other MPI
functions (possibly issued by other threads). This semantic (blocking in
MPI and making progress until some condition is satisfied) can already
be achieved via MPI_WAIT. If the threads that wish to help do not want
to initiate an additional MPI operation that requires the MPI library to
do additional work, then they can use a generalised request. Please see
the code below for an example of this - based on the example given in
the latest document from ticket 217. Alternatively, as there is a
definition for MPI_IALLREDUCE in MPI 3.0, the master thread (although
OpenMP single would be better in this case because of the implicit
barrier) could initiate a non-blocking MPI reduction and all the threads
could wait on the returned request indicating that they all wish to
assist with that particular operation. If multiple operations are
required then MPI_WAITALL can be used in a similar manner.
Cheers,
Dan.
#include <math.h>
void team_fn() {
MPI_Request team;
double oldval = 0.0, newval = 9.9e99;
double tolerance = 1.0e-6;
double sendbuf[COUNT] = { 0.0 };
double recvbuf[COUNT] = { 0.0 };
MPI_Grequest_start(query_fn, free_fn, cancel_fn, &extra_state, &team);
#pragma omp parallel num_threads(omp_get_thread_limit()) \
shared(newval, oldval, sendbuf, team)
{
while (abs(newval - oldval) > tolerance) {
double myval = 0.0;
int i;
oldval = newval;
// An OpenMP reduction would be more appropriate here
#pragma omp for
for (i = 0; i < COUNT; i++) {
myval += do_work(i, sendbuf);
}
#pragma omp critical
{
newval += myval;
}
// ??? should there be a connection between newval and
sendbuf/recvbuf ???
#pragma omp master
{
MPI_Allreduce(sendbuf, recvbuf, COUNT, MPI_DOUBLE,
MPI_SUM, MPI_COMM_WORLD);
MPI_Grequest_complete(team);
}
// This is where the threads help with the MPI_ALLREDUCE
MPI_Wait(team);
// The OpenMP memory model requires a memory flush
operation here
#pragma omp barrier
#pragma omp for
for (i = 0; i < COUNT; i++) {
sendbuf[i] = recvbuf[i];
}
}
}
}
--
Dan Holmes
Applications Developer
EPCC, The University of Edinburgh
James Clerk Maxwell Building
The Kings Buildings
Mayfield Road
Edinburgh, UK
EH9 3JZ
T: +44(0)131 651 3465
E: dholmes at epcc.ed.ac.uk
*Please consider the environment before printing this email.*
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the mpiwg-hybridpm
mailing list