[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 07:11:43 CDT 2013

Dude, look at PAMI endpoints.  They provide a lockless way for threads
to use their own private network resources.  This means that every
thread can use the network at the same time.  That's concurrency.

I cannot achieve this with MPI because MPI forces me to touch shared
state internally and therefore threads have to lock at some point (or
have magic multi-word atomic support).

Jeff

On Wed, Mar 20, 2013 at 7:09 AM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
> If all of your threads are active how to expect to
> have MPI provide concurrency? What resources do you
> expect it to use? Or do you expect concurrency to
> manifest itself as arbitrary slowdown?
>
>
> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>
>> Back to the original point, I'm not going to hand over threads to MPI.
>> My application is going to use all of them.  Tell me again how helper
>> threads is solving my problem of concurrent communication?
>>
>> Jeff
>>
>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>> wrote:
>>>
>>>
>>> No, you are confusing absence of current use (which is
>>> at best nebulous) with programmer intent. The point is
>>> that the programmer is declaring the thread will not
>>> be used for user-level code so system software (including
>>> user-level middleware) can use the threads.
>>>
>>>
>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>
>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>> vendors can do the same.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Jeff:
>>>>>
>>>>> Sorry you are incorrect about helper threads. The point
>>>>> is to notify the MPI implementation that the threads are
>>>>> not currently in use and will not be in use for some time.
>>>>> No mechanism is currently available to do that in existing
>>>>> threading implementations.
>>>>>
>>>>> Bronis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>
>>>>>> Hi Bronis,
>>>>>>
>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>> send from multiple threads or do you want multiple
>>>>>>> threads to participate in processing the messages?
>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>> their processing suffice?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> While it is not in the current proposal - which is to say, my comments
>>>>>> that follow should not undermine the existing proposal - what I really
>>>>>> need is lockless communication from multiple threads, which is exactly
>>>>>> what PAMI endpoints provide already.
>>>>>>
>>>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>> out into multiple comm threads, I know that what you're proposing here
>>>>>> is nowhere near as efficient as it could be.
>>>>>>
>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>> available, so the situation on any other platform is much, much worse.
>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>> What happens when all the message headers get pushed into a shared
>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>>>> let me know.
>>>>>>
>>>>>> I have considered how it looks when each thread uses a different
>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>> via MPI on those threads, which means I now have to complete reinvent
>>>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>> collectives just like I can with processes.
>>>>>>
>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>
>>>>>>> My point is that merely because you send at the user
>>>>>>> level from multiple threads, you have no guarantee
>>>>>>> that the implementation does not serialize those
>>>>>>> messages using a single thread to do the processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> While I cannot guarantee that my implementation is good, if I have to
>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>
>>>>>> What I'm trying to achieve is a situationw here the implementation is
>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>> as far as I can tell.
>>>>>>
>>>>>>> I think what you really want is helper threads...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>> fully empowered already to implement all the things they claimed they
>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>> demonstrate this point.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond