[Mpi3-hybridpm] helper threads (forked from "Endpoints Proposal")

Wed Mar 20 14:47:44 CDT 2013

But the motivating use case of helper threads is collectives.  How is
a new thread going to be active when MPI_Allreduce is happening?  This
is the part I just don't understand.  Maybe Doug's examples are just
too limited.  Do you think helper threads are required for
point-to-point as well?

Thanks,

Jeff

On Wed, Mar 20, 2013 at 2:26 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
> Jeff:
>
> I agree that those bullet points are inaccurate. However,
> they are not why I am advocating the approach.
>
> Yes, you can trivially determine in the OpenMP runtime if
> a thread is not currently involved in running user code.
> However, you cannot determine that the next cycle(s) will
> not lead to a thread becoming involved. The point is to
> provide the user a mechanism to do that for MPI.
>
>
> Bronis
>
>
> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>
>> Let's assume for the sake of argument that OpenMP is the only
>> threading model (we can generalize later)...
>>
>> Can you explain why an MPI implementation cannot use OpenMP internally
>> and let the existing mechanisms within OpenMP runtimes for not
>> oversubscribing take care of things?
>>
>> I looked back at
>>
>> http://meetings.mpi-forum.org/secretary/2010/06/slides/mpi3_helperthreads.pdf
>> and see two fundamental errors in the assumptions made, which is why I
>> view this proposal with skepticism.
>>
>> "But the MPI implementation cannot spawn its own threads" - False.
>> Blue Gene/Q MPI spawns threads.
>> "Difficult to identify whether the application threads are “active” or
>> not" - False.  The operating system obviously knows whether threads
>> are active or not.  The motivating architecture for endpoints was an
>> obvious case where MPI-OS interactions could solve this trivially.
>>
>> I am certainly not an OpenMP expert like you are, but my limited
>> understanding of both the spec and the standard suggest that OpenMP
>> can manage it's thread pool in a such way that MPI is safe to use
>> OpenMP.  In the worst case, MPI ends up where MKL, etc. are, which is
>> that they have to use a single thread when it is unsafe to do
>> otherwise.
>>
>> Jeff
>>
>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>> wrote:
>>>
>>>
>>> No, you are confusing absence of current use (which is
>>> at best nebulous) with programmer intent. The point is
>>> that the programmer is declaring the thread will not
>>> be used for user-level code so system software (including
>>> user-level middleware) can use the threads.
>>>
>>>
>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>
>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>> vendors can do the same.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Jeff:
>>>>>
>>>>> Sorry you are incorrect about helper threads. The point
>>>>> is to notify the MPI implementation that the threads are
>>>>> not currently in use and will not be in use for some time.
>>>>> No mechanism is currently available to do that in existing
>>>>> threading implementations.
>>>>>
>>>>> Bronis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>
>>>>>> Hi Bronis,
>>>>>>
>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>> send from multiple threads or do you want multiple
>>>>>>> threads to participate in processing the messages?
>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>> their processing suffice?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> While it is not in the current proposal - which is to say, my comments
>>>>>> that follow should not undermine the existing proposal - what I really
>>>>>> need is lockless communication from multiple threads, which is exactly
>>>>>> what PAMI endpoints provide already.
>>>>>>
>>>>>> It is certainly possible to post a bunch of nonblocking send/recv and
>>>>>> let MPI parallelize in the waitall, but since I know exactly what this
>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all those
>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>> out into multiple comm threads, I know that what you're proposing here
>>>>>> is nowhere near as efficient as it could be.
>>>>>>
>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>> available, so the situation on any other platform is much, much worse.
>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>> What happens when all the message headers get pushed into a shared
>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>> then we enter waitall?  If you have good ideas on how to make this as
>>>>>> efficient as the way it would happen with PAMI-style endpoints, please
>>>>>> let me know.
>>>>>>
>>>>>> I have considered how it looks when each thread uses a different
>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>> via MPI on those threads, which means I now have to complete reinvent
>>>>>> the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>> collectives just like I can with processes.
>>>>>>
>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>
>>>>>>> My point is that merely because you send at the user
>>>>>>> level from multiple threads, you have no guarantee
>>>>>>> that the implementation does not serialize those
>>>>>>> messages using a single thread to do the processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> While I cannot guarantee that my implementation is good, if I have to
>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>
>>>>>> What I'm trying to achieve is a situationw here the implementation is
>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>> as far as I can tell.
>>>>>>
>>>>>>> I think what you really want is helper threads...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>> fully empowered already to implement all the things they claimed they
>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>> been sufficient.  I'm trying to write a paper on how to do everything
>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>> demonstrate this point.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>
>>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond