[mpiwg-hybridpm] Concurrency requirement for collective operations with multiple endpoints
Daniel Holmes
dholmes at epcc.ed.ac.uk
Fri Jun 20 08:46:13 CDT 2014
Hi all,
In summarising the key concepts for the introduction to a paper I am
writing, I think I have discovered an ambiguity in the current wording
for the definition of endpoints.
The current wording in question is as follows (page 245, line2 14-15 -
all references from latest MPI 3.0+ticket380 document, i.e.
https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/ticket/380/mpi-report.5.pdf):
"a collective function on the new communicator must be called
concurrently on every rank in this communicator"
I believe this is intended to exclude the possibility of calling a
collective operation using fewer threads than the number of local endpoints.
For blocking collective operations, it achieves this exclusion: all
collective operations are permitted to be synchronising (even if the
formal definition does not require it) and so it cannot be assumed that
any call to a blocking collective function will return before all other
ranks/endpoints have initiated that operation. Each call to a
synchronising blocking collective "uses up" a thread because that call
will block until other calls are made by the other ranks, i.e. using the
other endpoint communicator handles. A correct program must use a
different thread for each call, even if the first call is an hour
earlier than the others and even if <insert favourite MPI
implementation> doesn't actually synchronise.
For non-blocking collective operations, I claim that the wording is not
sufficiently precise to avoid ambiguity.
Consider the case of starting one single-threaded OS process,
initialising MPI with MPI_THREAD_SINGLE and creating an endpoints
communicator with parent MPI_COMM_SELF and my_num_ep=4 (because this is
one of the easiest cases to reason about). Is it a correct program if it
then calls MPI_IALLREDUCE 4 times using a different endpoint
communicator handle each time and then repeatedly calls MPI_TESTALL
supplying all four endpoint communicator handles until flag=true?
The "collective function" has been called "concurrently on every rank in
this communicator" in that it has been initiated by all ranks/endpoints
before any rank/endpoint even attempts to complete the operation: it is
in progress/active for all ranks/endpoints simultaneously (even stronger
than concurrently).
All the buffers and other arguments have been supplied - there is no
good reason that the operation cannot succeed, although the
implementation would be very different to one that can assume the
'right' number of threads will eventually participate.
I'm using MPI_TESTALL, which has the same effect as MPI_TEST called for
all requests in some arbitrary order (c.f. page 59, lines 26-29),
because then there is no reason for an attempt to complete the
collective for one rank/endpoint to prevent future attempts to complete
the collective for other ranks/endpoints.
I believe the related question that uses MPI_WAITALL is also useful.
Assume MPI_WAITALL is implemented as MPI_WAIT for each request (page 59,
lines 26-29). Assume that all non-blocking collectives are implemented
so that the completion call "synchronises the processes" (page 197 lines
17-19). The first MPI_WAIT relies on the unreachable future MPI_WAITs
and cannot complete due to the same deadlock as for blocking collective
operations.
By "collective function", do we mean "collective operation" or
"collective complete function"?
By "concurrently", do we mean "without specifying a particular
chronological ordering or interleaving" (c.f. page 41, lines 13-14) or
"simultaneously on multiple independently executing threads", which is
captured better by the implications of permitting synchronising behaviour?
Basically, inside my MPI library, can I count threads entering a
collective operation or do I have to track endpoint handles supplied for
a collective operation? I want the wording to specify that each thread
can only carry one endpoint handle into the collective operation and
that other endpoint handles will be supplied by other threads because (I
believe) that simplifies implementation. However, that is a much weaker
reason for excluding this possibility than "it cannot work without
changing existing definitions of terms like "synchronising", etc.
Incidentally, what does "implementations are allowed to synchronize
processes during the completion of a non-blocking collective operation"
(page 197 lines 17-19) mean if the processes are using repeated calls of
MPI_TEST to complete the operation?
Cheers,
Dan.
--
Dan Holmes
Applications Consultant in HPC Research
EPCC, The University of Edinburgh
James Clerk Maxwell Building
The Kings Buildings
Mayfield Road
Edinburgh, UK
EH9 3JZ
T: +44(0)131 651 3465
E: dholmes at epcc.ed.ac.uk
*Please consider the environment before printing this email.*
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the mpiwg-hybridpm
mailing list