[Mpi3-hybridpm] helper threads (forked from "Endpoints Proposal")

Wed Mar 20 07:32:23 CDT 2013

Let's assume for the sake of argument that OpenMP is the only
threading model (we can generalize later)...

Can you explain why an MPI implementation cannot use OpenMP internally
and let the existing mechanisms within OpenMP runtimes for not
oversubscribing take care of things?

I looked back at
http://meetings.mpi-forum.org/secretary/2010/06/slides/mpi3_helperthreads.pdf
and see two fundamental errors in the assumptions made, which is why I
view this proposal with skepticism.

"But the MPI implementation cannot spawn its own threads" - False.
Blue Gene/Q MPI spawns threads.
"Difficult to identify whether the application threads are “active” or
not" - False.  The operating system obviously knows whether threads
are active or not.  The motivating architecture for endpoints was an
obvious case where MPI-OS interactions could solve this trivially.

I am certainly not an OpenMP expert like you are, but my limited
understanding of both the spec and the standard suggest that OpenMP
can manage it's thread pool in a such way that MPI is safe to use
OpenMP.  In the worst case, MPI ends up where MKL, etc. are, which is
that they have to use a single thread when it is unsafe to do
otherwise.

Jeff

On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
> No, you are confusing absence of current use (which is
> at best nebulous) with programmer intent. The point is
> that the programmer is declaring the thread will not
> be used for user-level code so system software (including
> user-level middleware) can use the threads.
>
>
> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>
>> Sorry, but you're confusing portable, standardizable solutions with
>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>> thread that exists and MPI can query that.  Problem solved.  Other
>> vendors can do the same.
>>
>> Jeff
>>
>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>> wrote:
>>>
>>>
>>> Jeff:
>>>
>>> Sorry you are incorrect about helper threads. The point
>>> is to notify the MPI implementation that the threads are
>>> not currently in use and will not be in use for some time.
>>> No mechanism is currently available to do that in existing
>>> threading implementations.
>>>
>>> Bronis
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>
>>>> Hi Bronis,
>>>>
>>>>> Do you really need endpoints for this? Do you want to
>>>>> send from multiple threads or do you want multiple
>>>>> threads to participate in processing the messages?
>>>>> Might a better way to specify a set of (nonblocking?)
>>>>> messages and an underlying implementation that parallelizes
>>>>> their processing suffice?
>>>>
>>>>
>>>>
>>>> While it is not in the current proposal - which is to say, my comments
>>>> that follow should not undermine the existing proposal - what I really
>>>> need is lockless communication from multiple threads, which is exactly
>>>> what PAMI endpoints provide already.
>>>>
>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>> concurrent operations into shared state and then pulls them all back
>>>> out into multiple comm threads, I know that what you're proposing here
>>>> is nowhere near as efficient as it could be.
>>>>
>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>> available, so the situation on any other platform is much, much worse.
>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>> What happens when all the message headers get pushed into a shared
>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>> let me know.
>>>>
>>>> I have considered how it looks when each thread uses a different
>>>> communicator and the MPI implementation can use a per-comm message
>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>> via MPI on those threads, which means I now have to complete reinvent
>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>> collectives just like I can with processes.
>>>>
>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>> demonstrate the utility of lockless endpoints.
>>>>
>>>>> My point is that merely because you send at the user
>>>>> level from multiple threads, you have no guarantee
>>>>> that the implementation does not serialize those
>>>>> messages using a single thread to do the processing.
>>>>
>>>>
>>>>
>>>> While I cannot guarantee that my implementation is good, if I have to
>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>> possibility that the implementation can do concurrency properly.
>>>>
>>>> What I'm trying to achieve is a situationw here the implementation is
>>>> not _required_ to serialize those messages, which is what has to
>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>> as far as I can tell.
>>>>
>>>>> I think what you really want is helper threads...
>>>>
>>>>
>>>>
>>>> Definitely not.  I thought about that proposal for a long time since
>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>> fully empowered already to implement all the things they claimed they
>>>> needed helper threads for.  It just would have required them to talk
>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>> in the helper threads proposal without any new MPI functions to
>>>> demonstrate this point.
>>>>
>>>> Best,
>>>>
>>>> Jeff
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond