[Mpi3-rma] RMA proposal 1 update

Michael Blocksome blocksom at us.ibm.com
Mon May 24 13:24:40 CDT 2010

I am one of the DCMF developers that wrote the low-level implementation 
for the various one-sided operations on BG/P and I've also done some work 
with implementing ARMCI over DCMF on BG/P.  I have some general comments, 
but can't really find an appropriate email to 'reply to' .. so I'll just 
dive in here.

I think the discussion of the three possible implementations below misses 
the mark a bit.  Any low level messaging layer (DCMF, LAPI, even "MPI") 
that can optimize the allreduce and barrier operations could implement the 
allfenceall operation as described below.  Even though DCMF - which what 
MPICH2 on BG/P is based on - uses the BG/P collective network, it still is 
*only* a fast allreduce that has benefits over other pt2pt allreduce 
algorithms and not very interesting when considering how an allfenceall 
would be implemented. 

If we look at any exascale goal(s), as Jeff mentioned earlier, we'll see 
that any time an application or middleware needs to track counters we 
won't get the scaling required (assuming the allfenceall is flexible and 
not restricted to only "comm_world" operations). What is needed is an 
interface that allows low level software to take advantage of any hardware 
or network features that would enable a "counter-less" allfenceall 
implementation. On BG/P, this would probably take advantage of features 
such as the torus dimensions of the network, the (configurable) 
deterministic packet routing, the "deposit bit" in the packet header, etc. 
I'm not saying I have any specific design in mind, just that experience 
tells me that we should be able to do something cool in this space 
supported by the BG hardware which would eliminate the need for software 

If there was an allfenceall interface, then low level message could do 
whatever it takes to get performance and scaling.  Without the allfenceall 
interface middleware is forced to use one of these platform independent 
algorithms (allreduce, barrier, etc) and accept any inherent scaling and 
performance restrictions.

Michael Blocksome
Blue Gene Messaging
Advanced Systems Software Development
blocksom at us.ibm.com

Jeff Hammond <jeff.science at gmail.com>
"MPI 3.0 Remote Memory Access working group" 
<mpi3-rma at lists.mpi-forum.org>
05/21/2010 05:13 PM
Re: [Mpi3-rma] RMA proposal 1 update

>> I've been talking about allflushall because ARMCI_Barrier is partially
>> redundant with MPI_Barrier, which is clearly already in MPI-3.
> Intuitively named as ARMCI_Barrier - I imagine why I missed that ;-) 
 So, this offers some opportunity for experimentation...   ARMCI_Barrier 
should be what GA_Sync calls, right?  How is the implementation of 
ARMCI_Barrier done on BG?  It would seem that we could try 3 
implementations in some real world usage scenarios and get some data:

Yes, and not to be confused with armci_msg_barrier


> 1) ARMCI_Barrier the "really bad" way:  i.e. np^2 messages get generated 
and then a barrier is done
> 2) ARMCI_Barrier as a collective:  allreduce is used to agree about 
> 3) ARMCI_Barrier assuming a "small" number of targets between 
barriers/fences:  you track who has outstanding messages and only sync 
with them.

Yeah, I'm going to work on this as soon as I have some free time.  I
have (1) and (3) already but I need to test them outside of NWChem to
have any hope of making sense of the data.  I think I can at least
hack (2).  I think I can implement a fourth option that beats (2) but
it has the same complexity model (logarithmic).

> Anyway, just to clarify, GA regularly uses both one-sided and collective 
completion calls?  Or, is it dominated by one or the other?  I look at 
GA_Fence() and see the equivalent of flushall() and GA_Sync = GA_Fence() + 
MPI_Barrier().  If you call both, then you have this mixture of passive 
and active target, but... if GA_Sync is going to perform significantly 
better than GA_Fence(), couldn't you just switch to calling GA_Sync?  It 
would seem like users would rather have the barrier too (especially if it 
was cheaper than calling GA_Fence()).  Or, put another way, the online 
guide for GA essentially says "um, don't call GA_Sync() very often", so is 
allfenceall optimizing the infrequent case?

The GA manual can say that calling GA_Sync will bring on the
apocalypse, but that won't stop quantum chemistry software developers
from calling it at the top and bottom of every subroutine.  =O

While there are too many GA_Sync calls in NWChem, removing the
majority of them requires a complete rewrite of very complex
algorithms and problem redesigning the entire code from top to bottom.

I have never seen GA_Fence used.  All ~remote~ completion in NWChem is
collective.  All three modes of local completion - trivial (blocking),
individual (request-based) and bulk (fenced target) are all used
explicitly via GA or implicitly within GA but invisible to the GA

GA_Fence was probably added much later in response to NWChem users, so
it isn't surprising it is never called in NWChem except in a "test all
GA functionality" utility routine.

> UPC has a similar scenario:  barrier implies a strict access that causes 
a flushall, but everybody knows that you avoid calling barrier to the 
extent you possibly can.  So, the model you use is actually focused on 
minimizing either strict accesses or barriers, but you would typically 
rather do a strict access than a barrier.  If GA code is optimized the 
same way, how often is GA_Sync() called?

GA_Sync is called like crazy in NWChem.  There is an effort to fix
this pathological behavior in one particularly heinous abuser (TCE)
but it will be a long time before NWChem learns to be kind to network


Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381

mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20100524/4a4594c5/attachment-0001.html>

More information about the mpiwg-rma mailing list