[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 03:25:43 CDT 2013

Jeff:

Do you really need endpoints for this? Do you want to
send from multiple threads or do you want multiple
threads to participate in processing the messages?
Might a better way to specify a set of (nonblocking?)
messages and an underlying implementation that parallelizes
their processing suffice?

My point is that merely because you send at the user
level from multiple threads, you have no guarantee
that the implementation does not serialize those
messages using a single thread to do the processing.
I think what you really want is helper threads...

Bronis

On Tue, 19 Mar 2013, Jeff Hammond wrote:

> On Tue, Mar 19, 2013 at 5:51 PM, Sur, Sayantan <sayantan.sur at intel.com> wrote:
>>>> Just as an example: Suppose there is an MPI+OpenMP app that runs on 16
>>> cores with 1 MPI rank and 16 threads. On certain platform you find out if
>>> there are two endpoints you get better network utilization. In this case, can
>>> you not just run 2 MPI ranks with 8 threads each? How is this not achieve the
>>> same effect as your endpoint proposal?
>>>
>>> Most apps run best with MPI only until they run out of memory.
>>
>> Yes, and folks that run out of memory (MPI only) would use threading to reduce some of the memory consumption.
>>
>> Adding endpoints that behave like ranks would not help the memory case.
>
> This is not the point at all.  Let me just assert that there are apps
> that want to use 1 MPI process per node.  The MPI Forum should try to
> enable these users to max out their networks.  If one endpoint isn't
> enough, then telling users to use more MPI processes per node is as
> stupid a response as telling them to buy more DRAM.  The right
> solution is to enable better comm perf via endpoints, provided we can
> identify a path forward in that respect.
>
>>> Your
>>> argument can and often does lead back to MPI-only if applied inductively.
>>
>> Folks can always adjust the balance of MPI ranks-to-threads to get to a point where adding more processes does not increase network-related performance and achieves the memory balance that you mention above.
>
> The whole point is that there are apps that MUST run with 1 MPI
> process per node and therefore arguing about the procs-to-thread
> balance is completely useless.  Some of us recognize that load-store
> is a damn efficient way to communicate within a shared memory domain
> and have apps that use OpenMP, Pthreads, TBB, etc. for task- and/or
> data-parallelism within shared memory domains.  Are you suggesting
> that we try to create hybrid process-thread queues and annihilate our
> existing software to put everything in POSIX shm just to get more
> network concurrency within a node?
>
>>> It's really not an intellectually stimulating example to discuss.
>>>
>> I am happy to look at other concrete examples that show the benefit of endpoints.
>
> MADNESS uses 1 MPI process per node and a TBB-like (we are moving to
> actual TBB right now) thread runtime.  We are memory limited in some
> cases.  We completely reject your solution of using >1 MPI process per
> node.  Threads are mostly autonomous and would benefit from endpoints
> so that they can issue remote futures with affinity to data, for
> example.
>
> I don't understand why you're fighting so hard about this.  Just take
> as fact that some apps need 1 MPI process per node and use threads
> within the node.  Given this definition of a problem, try to explore
> the solution space that enables maximum utilization of the
> interconnect, which might be multi-rail, multi-adapter, multi-link
> etc.  Multi-rail IB and Blue Gene/Q are good examples of existing tech
> where 1 MPI process per node might not be able to saturate the
> network.
>
> Jeff
>
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute jhammond at alcf.anl.gov / (630)
>>> 252-5381 http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>> _______________________________________________
>>> Mpi3-hybridpm mailing list
>>> Mpi3-hybridpm at lists.mpi-forum.org
>>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>>
>> _______________________________________________
>> Mpi3-hybridpm mailing list
>> Mpi3-hybridpm at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
> _______________________________________________
> Mpi3-hybridpm mailing list
> Mpi3-hybridpm at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-hybridpm
>