[Mpi3-hybridpm] Endpoints Proposal
balaji at mcs.anl.gov
Tue Mar 19 12:56:44 CDT 2013
On 03/19/2013 12:44 PM US Central Time, Sur, Sayantan wrote:
> For example, do we want to create independent endpoints for each
> thread? What would be the motivation of doing that? One thought is
> that it could help existing MPI+OpenMP codes to be ported in a
> conceptually similar manner (just that now each rank is really an
> endpoint). Now each endpoint can inject and receive messages to
> remote ranks/endpoints independently. However, there is memory cost
> to pay in this model. On a system with N nodes, P cores per node
> memory cost per node is O(NP^2).
The motivation is not to create one endpoint per thread. The motivation
is to give a model where the number of endpoints is not required to
always be 1 per process. Yes, in the extreme case, the user can create
as many endpoints as processes, but that's just an extreme usage case.
If we did limit how many endpoints can be created per address space,
what would that limit be?
> Now suppose we didn't really want to establish that sort of
> all-to-all connectivity, and optimize for memory requirement. Then we
> would have to restrict the communicator to some neighborhood. With P
> increasing fast, it is likely that most commonly used neighborhoods
> of stencils can span just one node. In that case, why would the app
> choose to use message passing within the node through MPI? The RMA
> shared memory interfaces that you worked on for MPI-3 are a good fit
> for this use case, if programmer really even wants to use MPI for
> sharing data on the node.
Is "we" the application user or MPI implementer? You are mixing the
two. From the MPI implementation's perspective, I don't think we need
to restrict anything. The application user has to flexibility to do
what you suggested (i.e., smaller parent communicators) or have fewer
endpoints on a larger communicator.
> AIUI one of the motivating points about endpoints is that performance
> through MPI_THREAD_MULTIPLE is so horrible. There were several slides
> presented in previous forums (and papers) that mention this. Now, is
> that a fault of the programming model or implementation? Can
> something be changed in the MPI standard that improves performance
> for MPI_THREAD_MULTIPLE? It seems to me like MPI_THREAD_MULTIPLE
> already provides you the flexibility to inject messages from threads
> if you so choose. The endpoint method is simply an optimization on
> top of that.
This is incorrect. This is not the motivation. It might be a side
benefit, but not what we are aiming for. I'd rather not dilute the
discussion by going into this.
> ii. There was consensus among the architects that the Endpoint comm
> created once and freed n times is not a preferred way to go. I think
> I have already made this point many times during the plenary :-)
You could look at it as "freed n times", but IMO that's just bad
wording. I think of it as the creation is collective over the parent
communicator, and the freeing is collective over the child communicator.
This is exactly like COMM_SPLIT, so we are being consistent.
> iii. The UPCMPI_World_comm_query(&upc_comm) call demonstrates that
> you need to interact with the UPC runtime so that it does the "right
> thing" based on whether it launched UPC in threaded or process mode.
> If that is the case, what is preventing the UPC community from just
> coming up with a UPCMPI_Allreduce() that would adapt to either
> thread/process cases. In the thread case, it could provide MPI with a
> datatype that points to memory bits being used for reduction. In the
> process case, it could pass the data straight into MPI. What do we
> gain by making the UPC program call MPI directly?
That's a very good point. Unfortunately, not all MPI operations can
benefit from user-defined datatypes. For example, in your example
MPI_Allreduce cannot. Also, UPC will have to keep track of what all
processes/threads the communicator covers. The example with upc_comm is
simple, but what if the user splits it further? MPI obviously keeps
track of this, but UPC does not.
More information about the mpiwg-hybridpm