[Mpi3-hybridpm] Endpoints Proposal

Tue Mar 19 12:44:04 CDT 2013

Hi Jim,

Thanks for sending the proposal to the WG and re-activating this list. We discussed the examples during the plenary, and I wanted to follow up with some more commentary. I was also wondering if we could have a higher level discussion of what we intend from the HybridPM WG.

For example, do we want to create independent endpoints for each thread? What would be the motivation of doing that? One thought is that it could help existing MPI+OpenMP codes to be ported in a conceptually similar manner (just that now each rank is really an endpoint). Now each endpoint can inject and receive messages to remote ranks/endpoints independently. However, there is memory cost to pay in this model. On a system with N nodes, P cores per node memory cost per node is O(NP^2).

Now suppose we didn't really want to establish that sort of all-to-all connectivity, and optimize for memory requirement. Then we would have to restrict the communicator to some neighborhood. With P increasing fast, it is likely that most commonly used neighborhoods of stencils can span just one node. In that case, why would the app choose to use message passing within the node through MPI? The RMA shared memory interfaces that you worked on for MPI-3 are a good fit for this use case, if programmer really even wants to use MPI for sharing data on the node.

AIUI one of the motivating points about endpoints is that performance through MPI_THREAD_MULTIPLE is so horrible. There were several slides presented in previous forums (and papers) that mention this. Now, is that a fault of the programming model or implementation? Can something be changed in the MPI standard that improves performance for MPI_THREAD_MULTIPLE? It seems to me like MPI_THREAD_MULTIPLE already provides you the flexibility to inject messages from threads if you so choose. The endpoint method is simply an optimization on top of that.

In that sense, how is this any different from the MPI Datatype discussion we had w.r.t. MPI_Sendv (Fab's proposal)? The Forum was pretty clear in that case that it needs to understand the fundamental limitations of Datatypes and not just work around them. What do you think?

---

I discussed the examples with the Threading architects of Intel's runtimes and they provided the following feedback for the OpenMP examples:

i. Slide 9,10: There is no implied barrier after #pragma omp master. Therefore, the program has  a race condition in which threads attach to endpoint before one has been created by the master.

ii. There was consensus among the architects that the Endpoint comm created once and freed n times is not a preferred way to go. I think I have already made this point many times during the plenary :-)

Another comment on the UPC example from my side:

iii. The UPCMPI_World_comm_query(&upc_comm) call demonstrates that you need to interact with the UPC runtime so that it does the "right thing" based on whether it launched UPC in threaded or process mode. If that is the case, what is preventing the UPC community from just coming up with a UPCMPI_Allreduce() that would adapt to either thread/process cases. In the thread case, it could provide MPI with a datatype that points to memory bits being used for reduction. In the process case, it could pass the data straight into MPI. What do we gain by making the UPC program call MPI directly?

Thanks,
Sayantan

> -----Original Message-----
> From: mpi3-hybridpm-bounces at lists.mpi-forum.org [mailto:mpi3-hybridpm-
> bounces at lists.mpi-forum.org] On Behalf Of Jim Dinan
> Sent: Thursday, March 14, 2013 7:30 AM
> To: mpi3-hybridpm at lists.mpi-forum.org
> Subject: [Mpi3-hybridpm] Endpoints Proposal
> 
> Hi All,
> 
> I've attached the slides from the endpoints presentation yesterday.  I
> updated the slides with corrections, additions, and suggestions gathered
> during the presentation.
> 
> We received a lot of feedback, and a lot of support from the Forum.  A
> refinement to the interface that eliminates MPI_Comm_attach was
> suggested:
> 
> int MPI_Comm_create_endpoints(MPI_Comm parent_comm, int
> my_num_ep,
>                                MPI_Info info, MPI_Comm output_comms[]);
> 
> This function would be collective over parent_comm, and produce an array
> of communicator handles, one per endpoint.  Threads pick up the endpoint
> they wish to use, and start using it; there would be no need for
> attach/detach.
> 
> This interface addresses two concerns about the interface originally
> presented that were raised by the Forum.  (1) The suggested interface does
> not require THREAD_MULTIPLE -- the original attach function could always
> require multiple.  (2) It places fewer dependencies on the threading model.
> In particular, stashing all relevant state in the MPI_Comm object removes a
> dependence on thread-local storage.
> 
> Thanks to everyone for your help and feedback.  Let's have some discussion
> about the suggested interface online, and follow up in a couple weeks with a
> WG meeting.
> 
> Cheers,
>   ~Jim.