[Mpi3-hybridpm] Endpoints Proposal

Bronis R. de Supinski bronis at llnl.gov
Wed Mar 20 07:19:57 CDT 2013


Jeff:

Re:
> Dude, look at PAMI endpoints.  They provide a lockless way for threads
> to use their own private network resources.  This means that every
> thread can use the network at the same time.  That's concurrency.

That is handing threads over to PAMI. What's the difference?

> I cannot achieve this with MPI because MPI forces me to touch shared
> state internally and therefore threads have to lock at some point (or
> have magic multi-word atomic support).

That is a different question from my point. You may be
right that endpoints are useful for your use case, I
was not arguing it; I was asking about it. I am still
not fully convinced -- even PAMI has to touch shared
state to determine that you are using distinct endpoints
unless you are saying they can only be associated with
a single thread -- the proposed MPI endpoints do not
address that problem. None of this means helper threads
would not also be useful.

Bronis


>
> Jeff
>
> On Wed, Mar 20, 2013 at 7:09 AM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>>
>> If all of your threads are active how to expect to
>> have MPI provide concurrency? What resources do you
>> expect it to use? Or do you expect concurrency to
>> manifest itself as arbitrary slowdown?
>>
>>
>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>
>>> Back to the original point, I'm not going to hand over threads to MPI.
>>> My application is going to use all of them.  Tell me again how helper
>>> threads is solving my problem of concurrent communication?
>>>
>>> Jeff
>>>
>>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>> wrote:
>>>>
>>>>
>>>> No, you are confusing absence of current use (which is
>>>> at best nebulous) with programmer intent. The point is
>>>> that the programmer is declaring the thread will not
>>>> be used for user-level code so system software (including
>>>> user-level middleware) can use the threads.
>>>>
>>>>
>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>
>>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>>> vendors can do the same.
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jeff:
>>>>>>
>>>>>> Sorry you are incorrect about helper threads. The point
>>>>>> is to notify the MPI implementation that the threads are
>>>>>> not currently in use and will not be in use for some time.
>>>>>> No mechanism is currently available to do that in existing
>>>>>> threading implementations.
>>>>>>
>>>>>> Bronis
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>
>>>>>>> Hi Bronis,
>>>>>>>
>>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>>> send from multiple threads or do you want multiple
>>>>>>>> threads to participate in processing the messages?
>>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>>> their processing suffice?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> While it is not in the current proposal - which is to say, my comments
>>>>>>> that follow should not undermine the existing proposal - what I really
>>>>>>> need is lockless communication from multiple threads, which is exactly
>>>>>>> what PAMI endpoints provide already.
>>>>>>>
>>>>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>>> out into multiple comm threads, I know that what you're proposing here
>>>>>>> is nowhere near as efficient as it could be.
>>>>>>>
>>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>>> available, so the situation on any other platform is much, much worse.
>>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>>> What happens when all the message headers get pushed into a shared
>>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>>>>> let me know.
>>>>>>>
>>>>>>> I have considered how it looks when each thread uses a different
>>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>>> via MPI on those threads, which means I now have to complete reinvent
>>>>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>>> collectives just like I can with processes.
>>>>>>>
>>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>>
>>>>>>>> My point is that merely because you send at the user
>>>>>>>> level from multiple threads, you have no guarantee
>>>>>>>> that the implementation does not serialize those
>>>>>>>> messages using a single thread to do the processing.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> While I cannot guarantee that my implementation is good, if I have to
>>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>>
>>>>>>> What I'm trying to achieve is a situationw here the implementation is
>>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>>> as far as I can tell.
>>>>>>>
>>>>>>>> I think what you really want is helper threads...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>>> fully empowered already to implement all the things they claimed they
>>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>>> demonstrate this point.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Hammond
>>>>>>> Argonne Leadership Computing Facility
>>>>>>> University of Chicago Computation Institute
>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>



More information about the mpiwg-hybridpm mailing list