[Mpi3-rma] RMA proposal 1 update

Underwood, Keith D keith.d.underwood at intel.com
Mon May 24 14:08:16 CDT 2010


I think the discussion of the three possible implementations below misses the mark a bit.  Any low level messaging layer (DCMF, LAPI, even "MPI") that can optimize the allreduce and barrier operations could implement the allfenceall operation as described below.  Even though DCMF - which what MPICH2 on BG/P is based on - uses the BG/P collective network, it still is *only* a fast allreduce that has benefits over other pt2pt allreduce algorithms and not very interesting when considering how an allfenceall would be implemented.

If we look at any exascale goal(s), as Jeff mentioned earlier, we'll see that any time an application or middleware needs to track counters we won't get the scaling required (assuming the allfenceall is flexible and not restricted to only "comm_world" operations). What is needed is an interface that allows low level software to take advantage of any hardware or network features that would enable a "counter-less" allfenceall implementation. On BG/P, this would probably take advantage of features such as the torus dimensions of the network, the (configurable) deterministic packet routing, the "deposit bit" in the packet header, etc. I'm not saying I have any specific design in mind, just that experience tells me that we should be able to do something cool in this space supported by the BG hardware which would eliminate the need for software accounting.

The forum was relatively clear in requiring an implementation for new calls.  In theory, that implementation kind of has to fit with the motivation for the call.  Saying "this would give performance advantages if we did something special, but this implementation doesn't give those advantages" isn't particularly compelling.   The implementations you mention from earlier in the thread could colloquially be called "the way it is done now", "the way IBM suggested", and "the way Keith suggested that may be completely ridiculous ".  Better implementations would certainly be interesting to discuss.
BTW, in many of the RMA discussions, it has been asserted that using the deterministic packet routing is a terrible performance hit.  Is that not an issue?

If there was an allfenceall interface, then low level message could do whatever it takes to get performance and scaling.  Without the allfenceall interface middleware is forced to use one of these platform independent algorithms (allreduce, barrier, etc) and accept any inherent scaling and performance restrictions.
I don't understand this comment.  Without the allfenceall, I'm not sure how you would call a collective to do a fenceall at a given node (fenceall would complete all outstanding requests from a given rank).    I'm also not sure why the difference between an allfenceall and a fenceall makes a difference between platform specific and platform independent approaches.
Keith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20100524/d965dc75/attachment-0001.html>


More information about the mpiwg-rma mailing list