[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 14:23:18 CDT 2013

Jeff:

Sorry, I am in Japan and do not currently have access
to that header file. If you want to send it to me
off list, I will look thorugh it. Perhaps you can direct
me to specifically what you think makes it so the PAMI
(or MPI) threads do not need CPU resources. I still
feel that you are missing my basic point, which has
nothing to do with "what IBM is telling" me. The
helper threads proposal is about communicating user-level
intent. No, you cannot derive that from existing
thread interfaces.

Bronis

On Wed, 20 Mar 2013, Jeff Hammond wrote:

>>> Dude, look at PAMI endpoints.  They provide a lockless way for threads
>>> to use their own private network resources.  This means that every
>>> thread can use the network at the same time.  That's concurrency.
>>
>> That is handing threads over to PAMI. What's the difference?
>
> Not in the slightest.  You are confusing what IBM has explained to you
> about how they implemented MPI on BGQ in order to make SQMR work with
> how PAMI actually works.  Please _actually_ look at the PAMI API and
> its semantics.  It's documented in
> /bgsys/drivers/ppcfloor/comm/sys/include/pami.h on your BGQ machines.
>
> My PAMI Pthreads code does not need to give any threads over to MPI
> and it can be lockless everywhere (except initialization, I suppose).
>
> I'm writing MMPS in PAMI+OpenMP right now just to make my point.  It's
> not trivial so please be patient.  You should ready the pami.h
> comments while you're waiting.
>
>>> I cannot achieve this with MPI because MPI forces me to touch shared
>>> state internally and therefore threads have to lock at some point (or
>>> have magic multi-word atomic support).
>>
>> That is a different question from my point. You may be
>> right that endpoints are useful for your use case, I
>> was not arguing it; I was asking about it. I am still
>> not fully convinced -- even PAMI has to touch shared
>> state to determine that you are using distinct endpoints
>> unless you are saying they can only be associated with
>> a single thread -- the proposed MPI endpoints do not
>> address that problem. None of this means helper threads
>> would not also be useful.
>
> I wrote the following an hour ago:
>
> =================================
> Hi Bronis,
>
> (snip)
>
> While it is not in the current proposal - which is to say, my comments
> that follow should not undermine the existing proposal - what I really
> need is lockless communication from multiple threads, which is exactly
> what PAMI endpoints provide already.
> =================================
>
> Note the part where I say "While it is not in the current
> proposal...what I really need is lockless communication from multiple
> threads, which is exactly what PAMI endpoints provide already."
>
> Now that we're clear that I am talking about endpoints as a concept,
> not the current proposal, do I still need to explain that PAMI
> endpoints can but need not be associated with a single thread and that
> this is how one achieves lockless communication?
>
> I hope that it wasn't too confusing that I discussed an idea that
> wasn't in the existing proposal.  However, Bill wants people to think
> generally about ways that MPI can be improved, not just provide small
> increments that benefit some people.  On this basis, I am proposing
> new ideas that would make the endpoint concept even more useful than
> it is now.
>
> Jeff
>
>>> On Wed, Mar 20, 2013 at 7:09 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>> wrote:
>>>>
>>>>
>>>> If all of your threads are active how to expect to
>>>> have MPI provide concurrency? What resources do you
>>>> expect it to use? Or do you expect concurrency to
>>>> manifest itself as arbitrary slowdown?
>>>>
>>>>
>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>
>>>>> Back to the original point, I'm not going to hand over threads to MPI.
>>>>> My application is going to use all of them.  Tell me again how helper
>>>>> threads is solving my problem of concurrent communication?
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> No, you are confusing absence of current use (which is
>>>>>> at best nebulous) with programmer intent. The point is
>>>>>> that the programmer is declaring the thread will not
>>>>>> be used for user-level code so system software (including
>>>>>> user-level middleware) can use the threads.
>>>>>>
>>>>>>
>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>
>>>>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>>>>> vendors can do the same.
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski
>>>>>>> <bronis at llnl.gov>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeff:
>>>>>>>>
>>>>>>>> Sorry you are incorrect about helper threads. The point
>>>>>>>> is to notify the MPI implementation that the threads are
>>>>>>>> not currently in use and will not be in use for some time.
>>>>>>>> No mechanism is currently available to do that in existing
>>>>>>>> threading implementations.
>>>>>>>>
>>>>>>>> Bronis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>>>
>>>>>>>>> Hi Bronis,
>>>>>>>>>
>>>>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>>>>> send from multiple threads or do you want multiple
>>>>>>>>>> threads to participate in processing the messages?
>>>>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>>>>> their processing suffice?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it is not in the current proposal - which is to say, my
>>>>>>>>> comments
>>>>>>>>> that follow should not undermine the existing proposal - what I
>>>>>>>>> really
>>>>>>>>> need is lockless communication from multiple threads, which is
>>>>>>>>> exactly
>>>>>>>>> what PAMI endpoints provide already.
>>>>>>>>>
>>>>>>>>> It is certainly possible to post a bunch of nonblocking send/recv
>>>>>>>>> and
>>>>>>>>> let MPI parallelize in the waitall, but since I know exactly what
>>>>>>>>> this
>>>>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all
>>>>>>>>> those
>>>>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>>>>> out into multiple comm threads, I know that what you're proposing
>>>>>>>>> here
>>>>>>>>> is nowhere near as efficient as it could be.
>>>>>>>>>
>>>>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>>>>> available, so the situation on any other platform is much, much
>>>>>>>>> worse.
>>>>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>>>>> What happens when all the message headers get pushed into a shared
>>>>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>>>>> then we enter waitall?  If you have good ideas on how to make this
>>>>>>>>> as
>>>>>>>>> efficient as the way it would happen with PAMI-style endpoints,
>>>>>>>>> please
>>>>>>>>> let me know.
>>>>>>>>>
>>>>>>>>> I have considered how it looks when each thread uses a different
>>>>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>>>>> via MPI on those threads, which means I now have to complete
>>>>>>>>> reinvent
>>>>>>>>> the wheel if I want to do e.g. an allreduce within my threads.
>>>>>>>>> OpenMP
>>>>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>>>>> collectives just like I can with processes.
>>>>>>>>>
>>>>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>>>>
>>>>>>>>>> My point is that merely because you send at the user
>>>>>>>>>> level from multiple threads, you have no guarantee
>>>>>>>>>> that the implementation does not serialize those
>>>>>>>>>> messages using a single thread to do the processing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While I cannot guarantee that my implementation is good, if I have
>>>>>>>>> to
>>>>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>>>>
>>>>>>>>> What I'm trying to achieve is a situationw here the implementation
>>>>>>>>> is
>>>>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>>>>> as far as I can tell.
>>>>>>>>>
>>>>>>>>>> I think what you really want is helper threads...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>>>>> fully empowered already to implement all the things they claimed
>>>>>>>>> they
>>>>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>>>>> been sufficient.  I'm trying to write a paper on how to do
>>>>>>>>> everything
>>>>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>>>>> demonstrate this point.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jeff
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Hammond
>>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>>> University of Chicago Computation Institute
>>>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Hammond
>>>>>>> Argonne Leadership Computing Facility
>>>>>>> University of Chicago Computation Institute
>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>