[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 07:29:40 CDT 2013

>> Dude, look at PAMI endpoints.  They provide a lockless way for threads
>> to use their own private network resources.  This means that every
>> thread can use the network at the same time.  That's concurrency.
>
> That is handing threads over to PAMI. What's the difference?

Not in the slightest.  You are confusing what IBM has explained to you
about how they implemented MPI on BGQ in order to make SQMR work with
how PAMI actually works.  Please _actually_ look at the PAMI API and
its semantics.  It's documented in
/bgsys/drivers/ppcfloor/comm/sys/include/pami.h on your BGQ machines.

My PAMI Pthreads code does not need to give any threads over to MPI
and it can be lockless everywhere (except initialization, I suppose).

I'm writing MMPS in PAMI+OpenMP right now just to make my point.  It's
not trivial so please be patient.  You should ready the pami.h
comments while you're waiting.

>> I cannot achieve this with MPI because MPI forces me to touch shared
>> state internally and therefore threads have to lock at some point (or
>> have magic multi-word atomic support).
>
> That is a different question from my point. You may be
> right that endpoints are useful for your use case, I
> was not arguing it; I was asking about it. I am still
> not fully convinced -- even PAMI has to touch shared
> state to determine that you are using distinct endpoints
> unless you are saying they can only be associated with
> a single thread -- the proposed MPI endpoints do not
> address that problem. None of this means helper threads
> would not also be useful.

I wrote the following an hour ago:

=================================
Hi Bronis,

(snip)

While it is not in the current proposal - which is to say, my comments
that follow should not undermine the existing proposal - what I really
need is lockless communication from multiple threads, which is exactly
what PAMI endpoints provide already.
=================================

Note the part where I say "While it is not in the current
proposal...what I really need is lockless communication from multiple
threads, which is exactly what PAMI endpoints provide already."

Now that we're clear that I am talking about endpoints as a concept,
not the current proposal, do I still need to explain that PAMI
endpoints can but need not be associated with a single thread and that
this is how one achieves lockless communication?

I hope that it wasn't too confusing that I discussed an idea that
wasn't in the existing proposal.  However, Bill wants people to think
generally about ways that MPI can be improved, not just provide small
increments that benefit some people.  On this basis, I am proposing
new ideas that would make the endpoint concept even more useful than
it is now.

Jeff

>> On Wed, Mar 20, 2013 at 7:09 AM, Bronis R. de Supinski <bronis at llnl.gov>
>> wrote:
>>>
>>>
>>> If all of your threads are active how to expect to
>>> have MPI provide concurrency? What resources do you
>>> expect it to use? Or do you expect concurrency to
>>> manifest itself as arbitrary slowdown?
>>>
>>>
>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>
>>>> Back to the original point, I'm not going to hand over threads to MPI.
>>>> My application is going to use all of them.  Tell me again how helper
>>>> threads is solving my problem of concurrent communication?
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> No, you are confusing absence of current use (which is
>>>>> at best nebulous) with programmer intent. The point is
>>>>> that the programmer is declaring the thread will not
>>>>> be used for user-level code so system software (including
>>>>> user-level middleware) can use the threads.
>>>>>
>>>>>
>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>
>>>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>>>> vendors can do the same.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski
>>>>>> <bronis at llnl.gov>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jeff:
>>>>>>>
>>>>>>> Sorry you are incorrect about helper threads. The point
>>>>>>> is to notify the MPI implementation that the threads are
>>>>>>> not currently in use and will not be in use for some time.
>>>>>>> No mechanism is currently available to do that in existing
>>>>>>> threading implementations.
>>>>>>>
>>>>>>> Bronis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>>
>>>>>>>> Hi Bronis,
>>>>>>>>
>>>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>>>> send from multiple threads or do you want multiple
>>>>>>>>> threads to participate in processing the messages?
>>>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>>>> their processing suffice?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> While it is not in the current proposal - which is to say, my
>>>>>>>> comments
>>>>>>>> that follow should not undermine the existing proposal - what I
>>>>>>>> really
>>>>>>>> need is lockless communication from multiple threads, which is
>>>>>>>> exactly
>>>>>>>> what PAMI endpoints provide already.
>>>>>>>>
>>>>>>>> It is certainly possible to post a bunch of nonblocking send/recv
>>>>>>>> and
>>>>>>>> let MPI parallelize in the waitall, but since I know exactly what
>>>>>>>> this
>>>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all
>>>>>>>> those
>>>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>>>> out into multiple comm threads, I know that what you're proposing
>>>>>>>> here
>>>>>>>> is nowhere near as efficient as it could be.
>>>>>>>>
>>>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>>>> available, so the situation on any other platform is much, much
>>>>>>>> worse.
>>>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>>>> What happens when all the message headers get pushed into a shared
>>>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>>>> then we enter waitall?  If you have good ideas on how to make this
>>>>>>>> as
>>>>>>>> efficient as the way it would happen with PAMI-style endpoints,
>>>>>>>> please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>> I have considered how it looks when each thread uses a different
>>>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>>>> via MPI on those threads, which means I now have to complete
>>>>>>>> reinvent
>>>>>>>> the wheel if I want to do e.g. an allreduce within my threads.
>>>>>>>> OpenMP
>>>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>>>> collectives just like I can with processes.
>>>>>>>>
>>>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>>>
>>>>>>>>> My point is that merely because you send at the user
>>>>>>>>> level from multiple threads, you have no guarantee
>>>>>>>>> that the implementation does not serialize those
>>>>>>>>> messages using a single thread to do the processing.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> While I cannot guarantee that my implementation is good, if I have
>>>>>>>> to
>>>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>>>
>>>>>>>> What I'm trying to achieve is a situationw here the implementation
>>>>>>>> is
>>>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>>>> as far as I can tell.
>>>>>>>>
>>>>>>>>> I think what you really want is helper threads...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>>>> fully empowered already to implement all the things they claimed
>>>>>>>> they
>>>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>>>> been sufficient.  I'm trying to write a paper on how to do
>>>>>>>> everything
>>>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>>>> demonstrate this point.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jeff
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Hammond
>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>> University of Chicago Computation Institute
>>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond