[Mpi3-hybridpm] Endpoints Proposal

Wed Mar 20 06:04:05 CDT 2013

Hi Bronis,

> Do you really need endpoints for this? Do you want to
> send from multiple threads or do you want multiple
> threads to participate in processing the messages?
> Might a better way to specify a set of (nonblocking?)
> messages and an underlying implementation that parallelizes
> their processing suffice?

While it is not in the current proposal - which is to say, my comments
that follow should not undermine the existing proposal - what I really
need is lockless communication from multiple threads, which is exactly
what PAMI endpoints provide already.

It is certainly possible to post a bunch of nonblocking send/recv and
let MPI parallelize in the waitall, but since I know exactly what this
entails on Blue Gene/Q w.r.t. how the implementation funnels all those
concurrent operations into shared state and then pulls them all back
out into multiple comm threads, I know that what you're proposing here
is nowhere near as efficient as it could be.

And Blue Gene/Q has by far the best support for MPI_THREAD_MULTIPLE
available, so the situation on any other platform is much, much worse.
 What happens if I want to do send/recv from 240 OpenMP threads on
Intel MIC (let's ignore the PCI connection for discussion purposes)?
What happens when all the message headers get pushed into a shared
queue (that certainly can't be friendly to the memory hierarchy) and
then we enter waitall?  If you have good ideas on how to make this as
efficient as the way it would happen with PAMI-style endpoints, please
let me know.

I have considered how it looks when each thread uses a different
communicator and the MPI implementation can use a per-comm message
queue.  However, this precludes the possibility of inter-thread comm
via MPI on those threads, which means I now have to complete reinvent
the wheel if I want to do e.g. an allreduce within my threads.  OpenMP
might make this possible but I write primarily Pthread and TBB apps.
I'd like to be able to have Pthreads = endpoints calling MPI
collectives just like I can with processes.

I will write up an example with PAMI+OpenMP or PAMI+Pthreads to
demonstrate the utility of lockless endpoints.

> My point is that merely because you send at the user
> level from multiple threads, you have no guarantee
> that the implementation does not serialize those
> messages using a single thread to do the processing.

While I cannot guarantee that my implementation is good, if I have to
use MPI_THREAD_MULTIPLE in its current form, I preclude the
possibility that the implementation can do concurrency properly.

What I'm trying to achieve is a situationw here the implementation is
not _required_ to serialize those messages, which is what has to
happen today.  Every machine besides BGQ serializes all the way down
as far as I can tell.

> I think what you really want is helper threads...

Definitely not.  I thought about that proposal for a long time since
the motivation was clearly Blue Gene and I told IBM that they were
fully empowered already to implement all the things they claimed they
needed helper threads for.  It just would have required them to talk
to their kernel and/or OpenMP people.  In fact, I think merely
intercepting pthread_create and pthread_"destroy" calls would have
been sufficient.  I'm trying to write a paper on how to do everything
in the helper threads proposal without any new MPI functions to
demonstrate this point.

Best,

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond