[Mpi3-rma] RMA proposal 1 update
blocksom at us.ibm.com
Mon May 24 15:12:08 CDT 2010
Right. I think all of the "suggested" implementations do not need a
special allfenceall interface because they are essentially variations on
the "fence or allfence, then barrier" idea. To justify a new interface
there must be some way that some implementation can use hardware-specific
optimizations that are not exposed through the existing pt2pt and
collective primitives. I'm just say that experience tells me we should be
able to do something interesting in this area on BG/P. I agree that is
not a very compelling argument, but I haven't the time to come up with a
full bgp design. :) Maybe Jeff, myself, and Brian can hash something out
that is a little more real.
I probably didn't wordsmith this very well, but I mean to say is that some
hardware features are only available to a platform at the very lowest
levels of the software stack. For example, on BG/P we have the raw DMA SPI
which represent 100% of the hardware features, but the public C interface
for DCMF doesn't provide all these details to its users (MPICH2 DCMFd
"glue" and ARMCI). For DCMF we have the luxury of adding interfaces as the
needs arise, and we would design a new interface so that the guts of DCMF
could use as many SPI tricks as possible to implement the semantics of the
interface for the user, as opposed to reusing existing DCMF C API
functions to implement the semantics o the interface.
Deterministic routing performance ..
Well, it depends on what you are trying to do. :) For bisection bandwidth
you'll definitely need dynamic routing to avoid network hotspots. However,
deterministic routing allows us to play games to eliminate, avoid, or
limit acks or active remote participation. The problem with counter arrays
or any data structure that grows with the size of the job is you just run
out of memory. This shows up on BG/P in virtual node mode at scale where
every additional 32 bits of a data structure will result in an additional
1MB of memory usage - and there is only 256 MB available per rank in vnm!
We use deterministic routing to provide a "remote completion" callback
event for DCMF_Put that only uses the dma hardware - no cores - so that
the put is truly one sided. The put packets are all deterministically
routed and then followed by a deterministically routed "remote get"
packet. The target dma will process the remote get packet only after it
has processed all of the put packets. When the primitive get operation
completes on the origin we know that all of the previous put packets have
been pulled from the network and written to memory on the target node.
I could go on and on, but don't want to pollute this thread any further.
Blue Gene Messaging
Advanced Systems Software Development
blocksom at us.ibm.com
"Underwood, Keith D" <keith.d.underwood at intel.com>
"MPI 3.0 Remote Memory Access working group"
<mpi3-rma at lists.mpi-forum.org>
05/24/2010 02:09 PM
Re: [Mpi3-rma] RMA proposal 1 update
I think the discussion of the three possible implementations below misses
the mark a bit. Any low level messaging layer (DCMF, LAPI, even "MPI")
that can optimize the allreduce and barrier operations could implement the
allfenceall operation as described below. Even though DCMF - which what
MPICH2 on BG/P is based on - uses the BG/P collective network, it still is
*only* a fast allreduce that has benefits over other pt2pt allreduce
algorithms and not very interesting when considering how an allfenceall
would be implemented.
If we look at any exascale goal(s), as Jeff mentioned earlier, we'll see
that any time an application or middleware needs to track counters we
won't get the scaling required (assuming the allfenceall is flexible and
not restricted to only "comm_world" operations). What is needed is an
interface that allows low level software to take advantage of any hardware
or network features that would enable a "counter-less" allfenceall
implementation. On BG/P, this would probably take advantage of features
such as the torus dimensions of the network, the (configurable)
deterministic packet routing, the "deposit bit" in the packet header, etc.
I'm not saying I have any specific design in mind, just that experience
tells me that we should be able to do something cool in this space
supported by the BG hardware which would eliminate the need for software
The forum was relatively clear in requiring an implementation for new
calls. In theory, that implementation kind of has to fit with the
motivation for the call. Saying “this would give performance advantages
if we did something special, but this implementation doesn’t give those
advantages” isn’t particularly compelling. The implementations you
mention from earlier in the thread could colloquially be called “the way
it is done now”, “the way IBM suggested”, and “the way Keith suggested
that may be completely ridiculous ”. Better implementations would
certainly be interesting to discuss.
BTW, in many of the RMA discussions, it has been asserted that using the
deterministic packet routing is a terrible performance hit. Is that not
If there was an allfenceall interface, then low level message could do
whatever it takes to get performance and scaling. Without the allfenceall
interface middleware is forced to use one of these platform independent
algorithms (allreduce, barrier, etc) and accept any inherent scaling and
I don’t understand this comment. Without the allfenceall, I’m not sure
how you would call a collective to do a fenceall at a given node (fenceall
would complete all outstanding requests from a given rank). I’m also
not sure why the difference between an allfenceall and a fenceall makes a
difference between platform specific and platform independent approaches.
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-rma