[Mpi3-hybridpm] Endpoints Proposal

Tue Mar 19 20:51:46 CDT 2013

Jeff,

Thanks for your explanation. I am trying to find out the motivations for
the proposal. The archives for mpi3-hybridpm were a little sparse ;-)

I agree that there are cases where you would just like to use 1 MPI process
per node with a useful threading runtime within the node. I was suggesting
that in this case, it might be possible to use the MPI Shared Memory RMA
interface in MPI-3 (not to use posix shm and use hybrid process/thread
queues). Would you say that it doesn't satisfy your use case? If not, why
not? After all, WE designed that API.

There are several approaches towards hybrid programming. I am trying to
understand how we have jumped to the conclusion that endpoints are the
answer and are supposed to discuss the API. I don't see this discussion in
the WG email list.

Thanks,
Sayantan.

On Tue, Mar 19, 2013 at 5:03 PM, Jeff Hammond <jhammond at alcf.anl.gov> wrote:

> On Tue, Mar 19, 2013 at 5:51 PM, Sur, Sayantan <sayantan.sur at intel.com>
> wrote:
> >> > Just as an example: Suppose there is an MPI+OpenMP app that runs on 16
> >> cores with 1 MPI rank and 16 threads. On certain platform you find out
> if
> >> there are two endpoints you get better network utilization. In this
> case, can
> >> you not just run 2 MPI ranks with 8 threads each? How is this not
> achieve the
> >> same effect as your endpoint proposal?
> >>
> >> Most apps run best with MPI only until they run out of memory.
> >
> > Yes, and folks that run out of memory (MPI only) would use threading to
> reduce some of the memory consumption.
> >
> > Adding endpoints that behave like ranks would not help the memory case.
>
> This is not the point at all.  Let me just assert that there are apps
> that want to use 1 MPI process per node.  The MPI Forum should try to
> enable these users to max out their networks.  If one endpoint isn't
> enough, then telling users to use more MPI processes per node is as
> stupid a response as telling them to buy more DRAM.  The right
> solution is to enable better comm perf via endpoints, provided we can
> identify a path forward in that respect.
>
> >> Your
> >> argument can and often does lead back to MPI-only if applied
> inductively.
> >
> > Folks can always adjust the balance of MPI ranks-to-threads to get to a
> point where adding more processes does not increase network-related
> performance and achieves the memory balance that you mention above.
>
> The whole point is that there are apps that MUST run with 1 MPI
> process per node and therefore arguing about the procs-to-thread
> balance is completely useless.  Some of us recognize that load-store
> is a damn efficient way to communicate within a shared memory domain
> and have apps that use OpenMP, Pthreads, TBB, etc. for task- and/or
> data-parallelism within shared memory domains.  Are you suggesting
> that we try to create hybrid process-thread queues and annihilate our
> existing software to put everything in POSIX shm just to get more
> network concurrency within a node?
>
> >> It's really not an intellectually stimulating example to discuss.
> >>
> > I am happy to look at other concrete examples that show the benefit of
> endpoints.
>
> MADNESS uses 1 MPI process per node and a TBB-like (we are moving to
> actual TBB right now) thread runtime.  We are memory limited in some
> cases.  We completely reject your solution of using >1 MPI process per
> node.  Threads are mostly autonomous and would benefit from endpoints
> so that they can issue remote futures with affinity to data, for
> example.
>
> I don't understand why you're fighting so hard about this.  Just take
> as fact that some apps need 1 MPI process per node and use threads
> within the node.  Given this definition of a problem, try to explore
> the solution space that enables maximum utilization of the
> interconnect, which might be multi-rail, multi-adapter, multi-link
> etc.  Multi-rail IB and Blue Gene/Q are good examples of existing tech
> where 1 MPI process per node might not be able to saturate the
> network.
>
> Jeff
>
> >> Jeff Hammond
> >> Argonne Leadership Computing Facility
> >> University of Chicago Computation Institute jhammond at alcf.anl.gov /
> (630)
> >> 252-5381 http://www.linkedin.com/in/jeffhammond
> >> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> >>
> >> _______________________________________________
> >> Mpi3-hybridpm mailing list
> >> Mpi3-hybridpm at lists.mpi-forum.org
> >> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
> >
> > _______________________________________________
> > Mpi3-hybridpm mailing list
> > Mpi3-hybridpm at lists.mpi-forum.org
> > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>
>
>
> --
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> _______________________________________________
> Mpi3-hybridpm mailing list
> Mpi3-hybridpm at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-hybridpm/attachments/20130319/1c637bab/attachment-0001.html>