[Mpi3-hybridpm] helper threads (forked from "Endpoints Proposal")

Wed Mar 20 14:26:51 CDT 2013

Jeff:

I agree that those bullet points are inaccurate. However,
they are not why I am advocating the approach.

Yes, you can trivially determine in the OpenMP runtime if
a thread is not currently involved in running user code.
However, you cannot determine that the next cycle(s) will
not lead to a thread becoming involved. The point is to
provide the user a mechanism to do that for MPI.

Bronis

On Wed, 20 Mar 2013, Jeff Hammond wrote:

> Let's assume for the sake of argument that OpenMP is the only
> threading model (we can generalize later)...
>
> Can you explain why an MPI implementation cannot use OpenMP internally
> and let the existing mechanisms within OpenMP runtimes for not
> oversubscribing take care of things?
>
> I looked back at
> http://meetings.mpi-forum.org/secretary/2010/06/slides/mpi3_helperthreads.pdf
> and see two fundamental errors in the assumptions made, which is why I
> view this proposal with skepticism.
>
> "But the MPI implementation cannot spawn its own threads" - False.
> Blue Gene/Q MPI spawns threads.
> "Difficult to identify whether the application threads are “active” or
> not" - False.  The operating system obviously knows whether threads
> are active or not.  The motivating architecture for endpoints was an
> obvious case where MPI-OS interactions could solve this trivially.
>
> I am certainly not an OpenMP expert like you are, but my limited
> understanding of both the spec and the standard suggest that OpenMP
> can manage it's thread pool in a such way that MPI is safe to use
> OpenMP.  In the worst case, MPI ends up where MKL, etc. are, which is
> that they have to use a single thread when it is unsafe to do
> otherwise.
>
> Jeff
>
> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>>
>> No, you are confusing absence of current use (which is
>> at best nebulous) with programmer intent. The point is
>> that the programmer is declaring the thread will not
>> be used for user-level code so system software (including
>> user-level middleware) can use the threads.
>>
>>
>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>
>>> Sorry, but you're confusing portable, standardizable solutions with
>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>> thread that exists and MPI can query that.  Problem solved.  Other
>>> vendors can do the same.
>>>
>>> Jeff
>>>
>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>> wrote:
>>>>
>>>>
>>>> Jeff:
>>>>
>>>> Sorry you are incorrect about helper threads. The point
>>>> is to notify the MPI implementation that the threads are
>>>> not currently in use and will not be in use for some time.
>>>> No mechanism is currently available to do that in existing
>>>> threading implementations.
>>>>
>>>> Bronis
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>
>>>>> Hi Bronis,
>>>>>
>>>>>> Do you really need endpoints for this? Do you want to
>>>>>> send from multiple threads or do you want multiple
>>>>>> threads to participate in processing the messages?
>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>> messages and an underlying implementation that parallelizes
>>>>>> their processing suffice?
>>>>>
>>>>>
>>>>>
>>>>> While it is not in the current proposal - which is to say, my comments
>>>>> that follow should not undermine the existing proposal - what I really
>>>>> need is lockless communication from multiple threads, which is exactly
>>>>> what PAMI endpoints provide already.
>>>>>
>>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>>> concurrent operations into shared state and then pulls them all back
>>>>> out into multiple comm threads, I know that what you're proposing here
>>>>> is nowhere near as efficient as it could be.
>>>>>
>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>> available, so the situation on any other platform is much, much worse.
>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>> What happens when all the message headers get pushed into a shared
>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>>> let me know.
>>>>>
>>>>> I have considered how it looks when each thread uses a different
>>>>> communicator and the MPI implementation can use a per-comm message
>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>> via MPI on those threads, which means I now have to complete reinvent
>>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>> collectives just like I can with processes.
>>>>>
>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>> demonstrate the utility of lockless endpoints.
>>>>>
>>>>>> My point is that merely because you send at the user
>>>>>> level from multiple threads, you have no guarantee
>>>>>> that the implementation does not serialize those
>>>>>> messages using a single thread to do the processing.
>>>>>
>>>>>
>>>>>
>>>>> While I cannot guarantee that my implementation is good, if I have to
>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>> possibility that the implementation can do concurrency properly.
>>>>>
>>>>> What I'm trying to achieve is a situationw here the implementation is
>>>>> not _required_ to serialize those messages, which is what has to
>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>> as far as I can tell.
>>>>>
>>>>>> I think what you really want is helper threads...
>>>>>
>>>>>
>>>>>
>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>> fully empowered already to implement all the things they claimed they
>>>>> needed helper threads for.  It just would have required them to talk
>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>>> in the helper threads proposal without any new MPI functions to
>>>>> demonstrate this point.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jeff
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>