[Mpi3-rma] RMA proposal 1 update

Michael Blocksome blocksom at us.ibm.com
Mon May 24 15:12:08 CDT 2010

Right. I think all of the "suggested" implementations do not need a 
special allfenceall interface because they are essentially variations on 
the "fence or allfence, then barrier" idea. To justify a new interface 
there must be some way that some implementation can use hardware-specific 
optimizations that are not exposed through the existing pt2pt and 
collective primitives.  I'm just say that experience tells me we should be 
able to do something interesting in this area on BG/P.  I agree that is 
not a very compelling argument, but I haven't the time to come up with a 
full bgp design. :)  Maybe Jeff, myself, and Brian can hash something out 
that is a little more real.

I probably didn't wordsmith this very well, but I mean to say is that some 
hardware features are only available to a platform at the very lowest 
levels of the software stack. For example, on BG/P we have the raw DMA SPI 
which represent 100% of the hardware features, but the public C interface 
for DCMF doesn't provide all these details to its users (MPICH2 DCMFd 
"glue" and ARMCI). For DCMF we have the luxury of adding interfaces as the 
needs arise, and we would design a new interface so that the guts of DCMF 
could use as many SPI tricks as possible to implement the semantics of the 
interface for the user, as opposed to reusing existing DCMF C API 
functions to implement the semantics o the interface.

Deterministic routing performance .. 

Well, it depends on what you are trying to do.  :) For bisection bandwidth 
you'll definitely need dynamic routing to avoid network hotspots. However, 
deterministic routing allows us to play games to eliminate, avoid, or 
limit acks or active remote participation. The problem with counter arrays 
or any data structure that grows with the size of the job is you just run 
out of memory. This shows up on BG/P in virtual node mode at scale where 
every additional 32 bits of a data structure will result in an additional 
1MB of memory usage - and there is only 256 MB available per rank in vnm! 

We use deterministic routing to provide a "remote completion" callback 
event for DCMF_Put that only uses the dma hardware - no cores - so that 
the put is truly one sided. The put packets are all deterministically 
routed and then followed by a deterministically routed "remote get" 
packet. The target dma will process the remote get packet only after it 
has processed all of the put packets. When the primitive get operation 
completes on the origin we know that all of the previous put packets have 
been pulled from the network and written to memory on the target node.

I could go on and on, but don't want to pollute this thread any further. 

Michael Blocksome
Blue Gene Messaging
Advanced Systems Software Development
blocksom at us.ibm.com

"Underwood, Keith D" <keith.d.underwood at intel.com>
"MPI 3.0 Remote Memory Access working group" 
<mpi3-rma at lists.mpi-forum.org>
05/24/2010 02:09 PM
Re: [Mpi3-rma] RMA proposal 1 update

I think the discussion of the three possible implementations below misses 
the mark a bit.  Any low level messaging layer (DCMF, LAPI, even "MPI") 
that can optimize the allreduce and barrier operations could implement the 
allfenceall operation as described below.  Even though DCMF - which what 
MPICH2 on BG/P is based on - uses the BG/P collective network, it still is 
*only* a fast allreduce that has benefits over other pt2pt allreduce 
algorithms and not very interesting when considering how an allfenceall 
would be implemented. 

If we look at any exascale goal(s), as Jeff mentioned earlier, we'll see 
that any time an application or middleware needs to track counters we 
won't get the scaling required (assuming the allfenceall is flexible and 
not restricted to only "comm_world" operations). What is needed is an 
interface that allows low level software to take advantage of any hardware 
or network features that would enable a "counter-less" allfenceall 
implementation. On BG/P, this would probably take advantage of features 
such as the torus dimensions of the network, the (configurable) 
deterministic packet routing, the "deposit bit" in the packet header, etc. 
I'm not saying I have any specific design in mind, just that experience 
tells me that we should be able to do something cool in this space 
supported by the BG hardware which would eliminate the need for software 

The forum was relatively clear in requiring an implementation for new 
calls.  In theory, that implementation kind of has to fit with the 
motivation for the call.  Saying “this would give performance advantages 
if we did something special, but this implementation doesn’t give those 
advantages” isn’t particularly compelling.   The implementations you 
mention from earlier in the thread could colloquially be called “the way 
it is done now”, “the way IBM suggested”, and “the way Keith suggested 
that may be completely ridiculous ”.  Better implementations would 
certainly be interesting to discuss.
BTW, in many of the RMA discussions, it has been asserted that using the 
deterministic packet routing is a terrible performance hit.  Is that not 
an issue?

If there was an allfenceall interface, then low level message could do 
whatever it takes to get performance and scaling.  Without the allfenceall 
interface middleware is forced to use one of these platform independent 
algorithms (allreduce, barrier, etc) and accept any inherent scaling and 
performance restrictions. 
I don’t understand this comment.  Without the allfenceall, I’m not sure 
how you would call a collective to do a fenceall at a given node (fenceall 
would complete all outstanding requests from a given rank).    I’m also 
not sure why the difference between an allfenceall and a fenceall makes a 
difference between platform specific and platform independent approaches.
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20100524/abf86cbf/attachment-0001.html>

More information about the mpiwg-rma mailing list