[Mpi3-hybridpm] Endpoints Proposal

Tue Mar 19 19:03:39 CDT 2013

On Tue, Mar 19, 2013 at 5:51 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
>> > Just as an example: Suppose there is an MPI+OpenMP app that runs on 16
>> cores with 1 MPI rank and 16 threads. On certain platform you find out if
>> there are two endpoints you get better network utilization. In this case, can
>> you not just run 2 MPI ranks with 8 threads each? How is this not achieve the
>> same effect as your endpoint proposal?
>>
>> Most apps run best with MPI only until they run out of memory.
>
> Yes, and folks that run out of memory (MPI only) would use threading to reduce some of the memory consumption.
>
> Adding endpoints that behave like ranks would not help the memory case.

This is not the point at all.  Let me just assert that there are apps
that want to use 1 MPI process per node.  The MPI Forum should try to
enable these users to max out their networks.  If one endpoint isn't
enough, then telling users to use more MPI processes per node is as
stupid a response as telling them to buy more DRAM.  The right
solution is to enable better comm perf via endpoints, provided we can
identify a path forward in that respect.

>> Your
>> argument can and often does lead back to MPI-only if applied inductively.
>
> Folks can always adjust the balance of MPI ranks-to-threads to get to a point where adding more processes does not increase network-related performance and achieves the memory balance that you mention above.

The whole point is that there are apps that MUST run with 1 MPI
process per node and therefore arguing about the procs-to-thread
balance is completely useless.  Some of us recognize that load-store
is a damn efficient way to communicate within a shared memory domain
and have apps that use OpenMP, Pthreads, TBB, etc. for task- and/or
data-parallelism within shared memory domains.  Are you suggesting
that we try to create hybrid process-thread queues and annihilate our
existing software to put everything in POSIX shm just to get more
network concurrency within a node?

>> It's really not an intellectually stimulating example to discuss.
>>
> I am happy to look at other concrete examples that show the benefit of endpoints.

MADNESS uses 1 MPI process per node and a TBB-like (we are moving to
actual TBB right now) thread runtime.  We are memory limited in some
cases.  We completely reject your solution of using >1 MPI process per
node.  Threads are mostly autonomous and would benefit from endpoints
so that they can issue remote futures with affinity to data, for
example.

I don't understand why you're fighting so hard about this.  Just take
as fact that some apps need 1 MPI process per node and use threads
within the node.  Given this definition of a problem, try to explore
the solution space that enables maximum utilization of the
interconnect, which might be multi-rail, multi-adapter, multi-link
etc.  Multi-rail IB and Blue Gene/Q are good examples of existing tech
where 1 MPI process per node might not be able to saturate the
network.

Jeff

>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute jhammond at alcf.anl.gov / (630)
>> 252-5381 http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>> _______________________________________________
>> Mpi3-hybridpm mailing list
>> Mpi3-hybridpm at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>
> _______________________________________________
> Mpi3-hybridpm mailing list
> Mpi3-hybridpm at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond