[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 07:09:00 CDT 2013

If all of your threads are active how to expect to
have MPI provide concurrency? What resources do you
expect it to use? Or do you expect concurrency to
manifest itself as arbitrary slowdown?

On Wed, 20 Mar 2013, Jeff Hammond wrote:

> Back to the original point, I'm not going to hand over threads to MPI.
> My application is going to use all of them.  Tell me again how helper
> threads is solving my problem of concurrent communication?
>
> Jeff
>
> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>>
>> No, you are confusing absence of current use (which is
>> at best nebulous) with programmer intent. The point is
>> that the programmer is declaring the thread will not
>> be used for user-level code so system software (including
>> user-level middleware) can use the threads.
>>
>>
>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>
>>> Sorry, but you're confusing portable, standardizable solutions with
>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>> thread that exists and MPI can query that.  Problem solved.  Other
>>> vendors can do the same.
>>>
>>> Jeff
>>>
>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>> wrote:
>>>>
>>>>
>>>> Jeff:
>>>>
>>>> Sorry you are incorrect about helper threads. The point
>>>> is to notify the MPI implementation that the threads are
>>>> not currently in use and will not be in use for some time.
>>>> No mechanism is currently available to do that in existing
>>>> threading implementations.
>>>>
>>>> Bronis
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>
>>>>> Hi Bronis,
>>>>>
>>>>>> Do you really need endpoints for this? Do you want to
>>>>>> send from multiple threads or do you want multiple
>>>>>> threads to participate in processing the messages?
>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>> messages and an underlying implementation that parallelizes
>>>>>> their processing suffice?
>>>>>
>>>>>
>>>>>
>>>>> While it is not in the current proposal - which is to say, my comments
>>>>> that follow should not undermine the existing proposal - what I really
>>>>> need is lockless communication from multiple threads, which is exactly
>>>>> what PAMI endpoints provide already.
>>>>>
>>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>>> concurrent operations into shared state and then pulls them all back
>>>>> out into multiple comm threads, I know that what you're proposing here
>>>>> is nowhere near as efficient as it could be.
>>>>>
>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>> available, so the situation on any other platform is much, much worse.
>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>> What happens when all the message headers get pushed into a shared
>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>>> let me know.
>>>>>
>>>>> I have considered how it looks when each thread uses a different
>>>>> communicator and the MPI implementation can use a per-comm message
>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>> via MPI on those threads, which means I now have to complete reinvent
>>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>> collectives just like I can with processes.
>>>>>
>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>> demonstrate the utility of lockless endpoints.
>>>>>
>>>>>> My point is that merely because you send at the user
>>>>>> level from multiple threads, you have no guarantee
>>>>>> that the implementation does not serialize those
>>>>>> messages using a single thread to do the processing.
>>>>>
>>>>>
>>>>>
>>>>> While I cannot guarantee that my implementation is good, if I have to
>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>> possibility that the implementation can do concurrency properly.
>>>>>
>>>>> What I'm trying to achieve is a situationw here the implementation is
>>>>> not _required_ to serialize those messages, which is what has to
>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>> as far as I can tell.
>>>>>
>>>>>> I think what you really want is helper threads...
>>>>>
>>>>>
>>>>>
>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>> fully empowered already to implement all the things they claimed they
>>>>> needed helper threads for.  It just would have required them to talk
>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>>> in the helper threads proposal without any new MPI functions to
>>>>> demonstrate this point.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jeff
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>