[Mpi3-hybridpm] helper threads (forked from "Endpoints Proposal")

Wed Mar 20 17:07:37 CDT 2013

Thanks.  This is a much better example.  I have a few comments/questions:

- PAMI comm threads already interoperate with OpenMP such that they
should be co-scheduled by the OS, or at least Doug was working on that
at my request.  The WU can be programmed to time-slice comm threads
with OpenMP threads to circumvent the max of 64 threads.  Granted,
this is very specific to BGQ, but I don't see why this type of
solution isn't viable for other implementers that might need comm
thread support.  In your example, I think such an implementation would
work just fine.

- The benefit of helper threads in send-recv is expected to be
matching, injection or something else?  My expectation would have been
that dedicated comm threads are most useful for MPI_Op computation in
reductions and accumulate.  I know how comm threads help matching and
injection on Cray and Blue Gene, but I don't think that helper threads
are the right solution here.  This is, of course, just my opinion
though.

Best,

Jeff

On Wed, Mar 20, 2013 at 3:50 PM, Bronis R. de Supinski <bronis at llnl.gov> wrote:
>
> Jeff:
>
> Re:
>
>> But the motivating use case of helper threads is collectives.  How is
>> a new thread going to be active when MPI_Allreduce is happening?  This
>> is the part I just don't understand.  Maybe Doug's examples are just
>> too limited.  Do you think helper threads are required for
>> point-to-point as well?
>
>
> My use case does not assume that the messages are collectives.
> Both collectives and p2p could benefit.
>
> Here is a some pseudo-code to illustrate how the issue could arise:
>
> #pragma omp parallel num_threads(5)
> {
>   do_some_single_level_thread_parallelism()
> }
> #pragma omp parallel num_threads(2)
> {
> #pragma omp sections
> {
> #pragma omp section
> {
>   do_some_stuff_and decide_MPI_op_is_ready();
>   MPI_Allreduce();
> }
> #pragma omp section
> {
>   while (have_stuff_to_do) {
>     do_some_single_threaded_stuff();
>     #pragma omp parallel num_threads(4)
>     {
>       do_some_nested_parallelism_stuff
>     }
>   }
> }
> }
> }
>
> So, in the second section, I alternate between sequential
> work and work parallelized over four threads. While in the
> sequential work (or around in the code shown), the other
> threads are not active but the system does not know how
> long they will be inactive. I'd like to make them available
> to MPI when the sequential stuff is "long", something the
> user would know.
>
> I could also make those threads appear "active" to OpenMP.
> Instead of iterating between sequential code and spawning
> parallel regions, I could have a single region in the inner
> parallel region.
>
> Personally, I don't see Allreduce as the best MPI op to
> illustrate this use case. I tend to think of Allreduces
> as using little data so parallelization within a node is
> not likely to be useful. A large send would be better...
>
> Bronis
>
>
>
>
>
>> Thanks,
>>
>> Jeff
>>
>> On Wed, Mar 20, 2013 at 2:26 PM, Bronis R. de Supinski <bronis at llnl.gov>
>> wrote:
>>>
>>>
>>> Jeff:
>>>
>>> I agree that those bullet points are inaccurate. However,
>>> they are not why I am advocating the approach.
>>>
>>> Yes, you can trivially determine in the OpenMP runtime if
>>> a thread is not currently involved in running user code.
>>> However, you cannot determine that the next cycle(s) will
>>> not lead to a thread becoming involved. The point is to
>>> provide the user a mechanism to do that for MPI.
>>>
>>>
>>> Bronis
>>>
>>>
>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>
>>>> Let's assume for the sake of argument that OpenMP is the only
>>>> threading model (we can generalize later)...
>>>>
>>>> Can you explain why an MPI implementation cannot use OpenMP internally
>>>> and let the existing mechanisms within OpenMP runtimes for not
>>>> oversubscribing take care of things?
>>>>
>>>> I looked back at
>>>>
>>>>
>>>> http://meetings.mpi-forum.org/secretary/2010/06/slides/mpi3_helperthreads.pdf
>>>> and see two fundamental errors in the assumptions made, which is why I
>>>> view this proposal with skepticism.
>>>>
>>>> "But the MPI implementation cannot spawn its own threads" - False.
>>>> Blue Gene/Q MPI spawns threads.
>>>> "Difficult to identify whether the application threads are “active” or
>>>> not" - False.  The operating system obviously knows whether threads
>>>> are active or not.  The motivating architecture for endpoints was an
>>>> obvious case where MPI-OS interactions could solve this trivially.
>>>>
>>>> I am certainly not an OpenMP expert like you are, but my limited
>>>> understanding of both the spec and the standard suggest that OpenMP
>>>> can manage it's thread pool in a such way that MPI is safe to use
>>>> OpenMP.  In the worst case, MPI ends up where MKL, etc. are, which is
>>>> that they have to use a single thread when it is unsafe to do
>>>> otherwise.
>>>>
>>>> Jeff
>>>>
>>>> On Wed, Mar 20, 2013 at 6:56 AM, Bronis R. de Supinski <bronis at llnl.gov>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> No, you are confusing absence of current use (which is
>>>>> at best nebulous) with programmer intent. The point is
>>>>> that the programmer is declaring the thread will not
>>>>> be used for user-level code so system software (including
>>>>> user-level middleware) can use the threads.
>>>>>
>>>>>
>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>
>>>>>> Sorry, but you're confusing portable, standardizable solutions with
>>>>>> what IBM can and should be doing in Blue Gene MPI.  CNK knows every
>>>>>> thread that exists and MPI can query that.  Problem solved.  Other
>>>>>> vendors can do the same.
>>>>>>
>>>>>> Jeff
>>>>>>
>>>>>> On Wed, Mar 20, 2013 at 6:22 AM, Bronis R. de Supinski
>>>>>> <bronis at llnl.gov>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Jeff:
>>>>>>>
>>>>>>> Sorry you are incorrect about helper threads. The point
>>>>>>> is to notify the MPI implementation that the threads are
>>>>>>> not currently in use and will not be in use for some time.
>>>>>>> No mechanism is currently available to do that in existing
>>>>>>> threading implementations.
>>>>>>>
>>>>>>> Bronis
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 20 Mar 2013, Jeff Hammond wrote:
>>>>>>>
>>>>>>>> Hi Bronis,
>>>>>>>>
>>>>>>>>> Do you really need endpoints for this? Do you want to
>>>>>>>>> send from multiple threads or do you want multiple
>>>>>>>>> threads to participate in processing the messages?
>>>>>>>>> Might a better way to specify a set of (nonblocking?)
>>>>>>>>> messages and an underlying implementation that parallelizes
>>>>>>>>> their processing suffice?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> While it is not in the current proposal - which is to say, my
>>>>>>>> comments
>>>>>>>> that follow should not undermine the existing proposal - what I
>>>>>>>> really
>>>>>>>> need is lockless communication from multiple threads, which is
>>>>>>>> exactly
>>>>>>>> what PAMI endpoints provide already.
>>>>>>>>
>>>>>>>> It is certainly possible to post a bunch of nonblocking send/recv
>>>>>>>> and
>>>>>>>> let MPI parallelize in the waitall, but since I know exactly what
>>>>>>>> this
>>>>>>>> entails on Blue Gene/Q w.r.t. how the implementation funnels all
>>>>>>>> those
>>>>>>>> concurrent operations into shared state and then pulls them all back
>>>>>>>> out into multiple comm threads, I know that what you're proposing
>>>>>>>> here
>>>>>>>> is nowhere near as efficient as it could be.
>>>>>>>>
>>>>>>>> And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
>>>>>>>> available, so the situation on any other platform is much, much
>>>>>>>> worse.
>>>>>>>> What happens if I want to do send/recv from 240 OpenMP threads on
>>>>>>>> Intel MIC (let's ignore the PCI connection for discussion purposes)?
>>>>>>>> What happens when all the message headers get pushed into a shared
>>>>>>>> queue (that certainly can't be friendly to the memory hierarchy) and
>>>>>>>> then we enter waitall?  If you have good ideas on how to make this
>>>>>>>> as
>>>>>>>> efficient as the way it would happen with PAMI-style endpoints,
>>>>>>>> please
>>>>>>>> let me know.
>>>>>>>>
>>>>>>>> I have considered how it looks when each thread uses a different
>>>>>>>> communicator and the MPI implementation can use a per-comm message
>>>>>>>> queue.  However, this precludes the possibility of inter-thread comm
>>>>>>>> via MPI on those threads, which means I now have to complete
>>>>>>>> reinvent
>>>>>>>> the wheel if I want to do e.g. an allreduce within my threads.
>>>>>>>> OpenMP
>>>>>>>> might make this possible but I write primarily Pthread and TBB apps.
>>>>>>>> I'd like to be able to have Pthreads = endpoints calling MPI
>>>>>>>> collectives just like I can with processes.
>>>>>>>>
>>>>>>>> I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
>>>>>>>> demonstrate the utility of lockless endpoints.
>>>>>>>>
>>>>>>>>> My point is that merely because you send at the user
>>>>>>>>> level from multiple threads, you have no guarantee
>>>>>>>>> that the implementation does not serialize those
>>>>>>>>> messages using a single thread to do the processing.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> While I cannot guarantee that my implementation is good, if I have
>>>>>>>> to
>>>>>>>> use MPI_THREAD_MULTIPLE in its current form, I preclude the
>>>>>>>> possibility that the implementation can do concurrency properly.
>>>>>>>>
>>>>>>>> What I'm trying to achieve is a situationw here the implementation
>>>>>>>> is
>>>>>>>> not _required_ to serialize those messages, which is what has to
>>>>>>>> happen today.  Every machine besides BGQ serializes all the way down
>>>>>>>> as far as I can tell.
>>>>>>>>
>>>>>>>>> I think what you really want is helper threads...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Definitely not.  I thought about that proposal for a long time since
>>>>>>>> the motivation was clearly Blue Gene and I told IBM that they were
>>>>>>>> fully empowered already to implement all the things they claimed
>>>>>>>> they
>>>>>>>> needed helper threads for.  It just would have required them to talk
>>>>>>>> to their kernel and/or OpenMP people.  In fact, I think merely
>>>>>>>> intercepting pthread_create and pthread_"destroy" calls would have
>>>>>>>> been sufficient.  I'm trying to write a paper on how to do
>>>>>>>> everything
>>>>>>>> in the helper threads proposal without any new MPI functions to
>>>>>>>> demonstrate this point.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jeff
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jeff Hammond
>>>>>>>> Argonne Leadership Computing Facility
>>>>>>>> University of Chicago Computation Institute
>>>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jeff Hammond
>>>>>> Argonne Leadership Computing Facility
>>>>>> University of Chicago Computation Institute
>>>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>>>> http://www.linkedin.com/in/jeffhammond
>>>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jeff Hammond
>>>> Argonne Leadership Computing Facility
>>>> University of Chicago Computation Institute
>>>> jhammond at alcf.anl.gov / (630) 252-5381
>>>> http://www.linkedin.com/in/jeffhammond
>>>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhammond at alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond