[mpiwg-hybridpm] Concurrency requirement for collective operations with multiple endpoints

Fri Jun 20 08:46:13 CDT 2014

Hi all,

In summarising the key concepts for the introduction to a paper I am 
writing, I think I have discovered an ambiguity in the current wording 
for the definition of endpoints.

The current wording in question is as follows (page 245, line2 14-15 - 
all references from latest MPI 3.0+ticket380 document, i.e. 
https://svn.mpi-forum.org/trac/mpi-forum-web/attachment/ticket/380/mpi-report.5.pdf):
"a collective function on the new communicator must be called 
concurrently on every rank in this communicator"

I believe this is intended to exclude the possibility of calling a 
collective operation using fewer threads than the number of local endpoints.

For blocking collective operations, it achieves this exclusion: all 
collective operations are permitted to be synchronising (even if the 
formal definition does not require it) and so it cannot be assumed that 
any call to a blocking collective function will return before all other 
ranks/endpoints have initiated that operation. Each call to a 
synchronising blocking collective "uses up" a thread because that call 
will block until other calls are made by the other ranks, i.e. using the 
other endpoint communicator handles. A correct program must use a 
different thread for each call, even if the first call is an hour 
earlier than the others and even if <insert favourite MPI 
implementation> doesn't actually synchronise.

For non-blocking collective operations, I claim that the wording is not 
sufficiently precise to avoid ambiguity.

Consider the case of starting one single-threaded OS process, 
initialising MPI with MPI_THREAD_SINGLE and creating an endpoints 
communicator with parent MPI_COMM_SELF and my_num_ep=4 (because this is 
one of the easiest cases to reason about). Is it a correct program if it 
then calls MPI_IALLREDUCE 4 times using a different endpoint 
communicator handle each time and then repeatedly calls MPI_TESTALL 
supplying all four endpoint communicator handles until flag=true?

The "collective function" has been called "concurrently on every rank in 
this communicator" in that it has been initiated by all ranks/endpoints 
before any rank/endpoint even attempts to complete the operation: it is 
in progress/active for all ranks/endpoints simultaneously (even stronger 
than concurrently).
All the buffers and other arguments have been supplied - there is no 
good reason that the operation cannot succeed, although the 
implementation would be very different to one that can assume the 
'right' number of threads will eventually participate.
I'm using MPI_TESTALL, which has the same effect as MPI_TEST called for 
all requests in some arbitrary order (c.f. page 59, lines 26-29), 
because then there is no reason for an attempt to complete the 
collective for one rank/endpoint to prevent future attempts to complete 
the collective for other ranks/endpoints.

I believe the related question that uses MPI_WAITALL is also useful. 
Assume MPI_WAITALL is implemented as MPI_WAIT for each request (page 59, 
lines 26-29). Assume that all non-blocking collectives are implemented 
so that the completion call "synchronises the processes" (page 197 lines 
17-19). The first MPI_WAIT relies on the unreachable future MPI_WAITs 
and cannot complete due to the same deadlock as for blocking collective 
operations.

By "collective function", do we mean "collective operation" or 
"collective complete function"?
By "concurrently", do we mean "without specifying a particular 
chronological ordering or interleaving" (c.f. page 41, lines 13-14) or 
"simultaneously on multiple independently executing threads", which is 
captured better by the implications of permitting synchronising behaviour?

Basically, inside my MPI library, can I count threads entering a 
collective operation or do I have to track endpoint handles supplied for 
a collective operation? I want the wording to specify that each thread 
can only carry one endpoint handle into the collective operation and 
that other endpoint handles will be supplied by other threads because (I 
believe) that simplifies implementation. However, that is a much weaker 
reason for excluding this possibility than "it cannot work without 
changing existing definitions of terms like "synchronising", etc.

Incidentally, what does "implementations are allowed to synchronize 
processes during the completion of a non-blocking collective operation" 
(page 197 lines 17-19) mean if the processes are using repeated calls of 
MPI_TEST to complete the operation?

Cheers,
Dan.

-- 
Dan Holmes
Applications Consultant in HPC Research
EPCC, The University of Edinburgh
James Clerk Maxwell Building
The Kings Buildings
Mayfield Road
Edinburgh, UK
EH9 3JZ
T: +44(0)131 651 3465
E: dholmes at epcc.ed.ac.uk

*Please consider the environment before printing this email.*

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.