[Mpi3-hybridpm] helper threads (forked from "Endpoints Proposal")

Wed Mar 20 17:42:14 CDT 2013

Jeff:

Re:
> Thanks.  This is a much better example.  I have a few comments/questions:
>
> - PAMI comm threads already interoperate with OpenMP such that they
> should be co-scheduled by the OS, or at least Doug was working on that
> at my request.  The WU can be programmed to time-slice comm threads
> with OpenMP threads to circumvent the max of 64 threads.  Granted,
> this is very specific to BGQ, but I don't see why this type of
> solution isn't viable for other implementers that might need comm
> thread support.  In your example, I think such an implementation would
> work just fine.

The issue is setting priority between message processing and
computation. Helper threads says "Prioritize computation; do
not use extra resources unless the user indicates that they
are available." So the user explicitly determines the relative
priority of local computation and message work. I don't think
it is perfect but at least the user is able to provide info
on the relative priority of the activities. A OS-level time
slicing solution would create noise for the computation when
the user knows that best effort for messaging is sufficient.

> - The benefit of helper threads in send-recv is expected to be
> matching, injection or something else?  My expectation would have been
> that dedicated comm threads are most useful for MPI_Op computation in
> reductions and accumulate.  I know how comm threads help matching and
> injection on Cray and Blue Gene, but I don't think that helper threads
> are the right solution here.  This is, of course, just my opinion
> though.

My thinking has focused on injection. I can see how it
helps with MPI_Op computation although most of our codes
use predefines that should not take long on short message
lengths (and could be implemented in the NIC). I have not
really thought about using them for matching. As you point
out, that issue poses some problems for paralllelization
(I think your serialization point comes into play).

Bronis

> Best,
>
> Jeff
>
> On Wed, Mar 20, 2013 at 3:50 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>>
>> Jeff:
>>
>> Re:
>>
>>> But the motivating use case of helper threads is collectives.  How is
>>> a new thread going to be active when MPI_Allreduce is happening?  This
>>> is the part I just don't understand.  Maybe Doug's examples are just
>>> too limited.  Do you think helper threads are required for
>>> point-to-point as well?
>>
>>
>> My use case does not assume that the messages are collectives.
>> Both collectives and p2p could benefit.
>>
>> Here is a some pseudo-code to illustrate how the issue could arise:
>>
>> #pragma omp parallel num_threads(5)
>> {
>>   do_some_single_level_thread_parallelism()
>> }
>> #pragma omp parallel num_threads(2)
>> {
>> #pragma omp sections
>> {
>> #pragma omp section
>> {
>>   do_some_stuff_and decide_MPI_op_is_ready();
>>   MPI_Allreduce();
>> }
>> #pragma omp section
>> {
>>   while (have_stuff_to_do) {
>>     do_some_single_threaded_stuff();
>>     #pragma omp parallel num_threads(4)
>>     {
>>       do_some_nested_parallelism_stuff
>>     }
>>   }
>> }
>> }
>> }
>>
>> So, in the second section, I alternate between sequential
>> work and work parallelized over four threads. While in the
>> sequential work (or around in the code shown), the other
>> threads are not active but the system does not know how
>> long they will be inactive. I'd like to make them available
>> to MPI when the sequential stuff is "long", something the
>> user would know.
>>
>> I could also make those threads appear "active" to OpenMP.
>> Instead of iterating between sequential code and spawning
>> parallel regions, I could have a single region in the inner
>> parallel region.
>>
>> Personally, I don't see Allreduce as the best MPI op to
>> illustrate this use case. I tend to think of Allreduces
>> as using little data so parallelization within a node is
>> not likely to be useful. A large send would be better...
>>
>> Bronis
>>
>>
>>
>>
>>
>>> Thanks,
>>>
>>> Jeff
>>>
>>> On Wed, Mar 20, 2013 at 2:26 PM, Bronis R. de Supinski <bronis at llnl.gov>
>>> wrote:
>>>>
>>>>
>>>> Jeff:
>>>>
>>>> I agree that those bullet points are inaccurate. However,
>>>> they are not why I am advocating the approach.
>>>>
>>>> Yes, you can trivially determine in the OpenMP runtime if
>>>> a thread is not currently involved in running user code.
>>>> However, you cannot determine that the next cycle(s) will
>>>> not lead to a thread becoming involved. The point is to
>>>> provide the user a mechanism to do that for MPI.
>>>>
>>>>
>>>> Bronis
>>>>
>>>>
>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>
>>>>> Let's assume for the sake of argument that OpenMP is the only
>>>>> threading model (we can generalize later)...
>>>>>
>>>>> Can you explain why an MPI implementation cannot use OpenMP internally
>>>>> and let the existing mechanisms within OpenMP runtimes for not
>>>>> oversubscribing take care of things?
>>>>>
>>>>> I looked back at
>>>>>
>>>>>
>>>>> http://meetings.mpi-forum.org/secretary/2010/06/slides/mpi3_helperthreads.pdf
>>>>> and see two fundamental errors in the assumptions made, which is why I
>>>>> view this proposal with skepticism.
>>>>>
>>>>> "But the MPI implementation cannot spawn its own threads" - False.
>>>>> Blue Gene/Q MPI spawns threads.
>>>>> "Difficult to identify whether the application threads are “active” or
>>>>> not" - False.  The operating system obviously knows whether threads
>>>>> are active or not.  The motivating architecture for endpoints was an
>>>>> obvious case where MPI-OS interactions could solve this trivially.
>>>>>
>>>>> I am certainly not an OpenMP expert like you are, but my limited
>>>>> understanding of both the spec and the standard suggest that OpenMP
>>>>> can manage it's thread pool in a such way that MPI is safe to use
>>>>> OpenMP.  In the worst case, MPI ends up where MKL, etc. are, which is
>>>>> that they have to use a single thread when it is unsafe to do
>>>>> otherwise.
>>>>>
>>>>> Jeff
>>>>>
>>>>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> No, you are confusing absence of current use (which is
>>>>>> at best nebulous) with programmer intent. The point is
>>>>>> that the programmer is declaring the thread will not
>>>>>> be used for user-level code so system software (including
>>>>>> user-level middleware) can use the threads.
>>>>>>
>>>>>>
>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>
>>>>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>>>>> vendors can do the same.
>>>>>>>
>>>>>>> Jeff
>>>>>>>
>>>>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski
>>>>>>> <bronis at llnl.gov>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeff:
>>>>>>>>
>>>>>>>> Sorry you are incorrect about helper threads. The point
>>>>>>>> is to notify the MPI implementation that the threads are
>>>>>>>> not currently in use and will not be in use for some time.
>>>>>>>> No mechanism is currently available to do that in existing
>>>>>>>> threading implementations.
>>>>>>>>
>>>>>>>> Bronis
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>>>
>>>>>>>>> Hi Bronis,
>>>>>>>>>
>>>>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>>>>> send from multiple threads or do you want multiple
>>>>>>>>>> threads to participate in processing the messages?
>>>>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>>>>> their processing suffice?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While it is not in the current proposal - which is to say, my
>>>>>>>>> comments
>>>>>>>>> that follow should not undermine the existing proposal - what I
>>>>>>>>> really
>>>>>>>>> need is lockless communication from multiple threads, which is
>>>>>>>>> exactly
>>>>>>>>> what PAMI endpoints provide already.
>>>>>>>>>
>>>>>>>>> It is certainly possible to post a bunch of nonblocking send/recv
>>>>>>>>> and
>>>>>>>>> let MPI parallelize in the waitall, but since I know exactly what
>>>>>>>>> this
>>>>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all
>>>>>>>>> those
>>>>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>>>>> out into multiple comm threads, I know that what you're proposing
>>>>>>>>> here
>>>>>>>>> is nowhere near as efficient as it could be.
>>>>>>>>>
>>>>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>>>>> available, so the situation on any other platform is much, much
>>>>>>>>> worse.
>>>>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>>>>> What happens when all the message headers get pushed into a shared
>>>>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>>>>> then we enter waitall?  If you have good ideas on how to make this
>>>>>>>>> as
>>>>>>>>> efficient as the way it would happen with PAMI-style endpoints,
>>>>>>>>> please
>>>>>>>>> let me know.
>>>>>>>>>
>>>>>>>>> I have considered how it looks when each thread uses a different
>>>>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>>>>> via MPI on those threads, which means I now have to complete
>>>>>>>>> reinvent
>>>>>>>>> the wheel if I want to do e.g. an allreduce within my threads.
>>>>>>>>> OpenMP
>>>>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>>>>> collectives just like I can with processes.
>>>>>>>>>
>>>>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>>>>
>>>>>>>>>> My point is that merely because you send at the user
>>>>>>>>>> level from multiple threads, you have no guarantee
>>>>>>>>>> that the implementation does not serialize those
>>>>>>>>>> messages using a single thread to do the processing.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> While I cannot guarantee that my implementation is good, if I have
>>>>>>>>> to
>>>>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>>>>
>>>>>>>>> What I'm trying to achieve is a situationw here the implementation
>>>>>>>>> is
>>>>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>>>>> as far as I can tell.
>>>>>>>>>
>>>>>>>>>> I think what you really want is helper threads...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>>>>> fully empowered already to implement all the things they claimed
>>>>>>>>> they
>>>>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>>>>> been sufficient.  I'm trying to write a paper on how to do
>>>>>>>>> everything
>>>>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>>>>> demonstrate this point.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jeff
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Hammond
>>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>>> University of Chicago Computation Institute
>>>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jeff Hammond
>>>>>>> Argonne Leadership Computing Facility
>>>>>>> University of Chicago Computation Institute
>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Hammond
>>>>> Argonne Leadership Computing Facility
>>>>> University of Chicago Computation Institute
>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>> http://www.linkedin.com/in/jeffhammond
>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>
>>>
>>>
>>>
>>> --
>>> Jeff Hammond
>>> Argonne Leadership Computing Facility
>>> University of Chicago Computation Institute
>>> jhammond at alcf.anl.gov / (630) 252-5381
>>> http://www.linkedin.com/in/jeffhammond
>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>
>
>
> -- 
> Jeff Hammond
> Argonne Leadership Computing Facility
> University of Chicago Computation Institute
> jhammond at alcf.anl.gov / (630) 252-5381
> http://www.linkedin.com/in/jeffhammond
> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>