[Mpi3-rma] RMA proposal 1 update

Fri May 21 17:12:43 CDT 2010

>> I've been talking about allflushall because ARMCI_Barrier is partially
>> redundant with MPI_Barrier, which is clearly already in MPI-3.
>
> Intuitively named as ARMCI_Barrier - I imagine why I missed that ;-)  So, this offers some opportunity for experimentation...   ARMCI_Barrier should be what GA_Sync calls, right?  How is the implementation of ARMCI_Barrier done on BG?  It would seem that we could try 3 implementations in some real world usage scenarios and get some data:

Yes, and not to be confused with armci_msg_barrier
(<http://www.emsl.pnl.gov/docs/parsoft/armci/documentation.htm#collect>)

:(

> 1) ARMCI_Barrier the "really bad" way:  i.e. np^2 messages get generated and then a barrier is done
> 2) ARMCI_Barrier as a collective:  allreduce is used to agree about completion
> 3) ARMCI_Barrier assuming a "small" number of targets between barriers/fences:  you track who has outstanding messages and only sync with them.

Yeah, I'm going to work on this as soon as I have some free time.  I
have (1) and (3) already but I need to test them outside of NWChem to
have any hope of making sense of the data.  I think I can at least
hack (2).  I think I can implement a fourth option that beats (2) but
it has the same complexity model (logarithmic).

> Anyway, just to clarify, GA regularly uses both one-sided and collective completion calls?  Or, is it dominated by one or the other?  I look at GA_Fence() and see the equivalent of flushall() and GA_Sync = GA_Fence() + MPI_Barrier().  If you call both, then you have this mixture of passive and active target, but... if GA_Sync is going to perform significantly better than GA_Fence(), couldn't you just switch to calling GA_Sync?  It would seem like users would rather have the barrier too (especially if it was cheaper than calling GA_Fence()).  Or, put another way, the online guide for GA essentially says "um, don't call GA_Sync() very often", so is allfenceall optimizing the infrequent case?

The GA manual can say that calling GA_Sync will bring on the
apocalypse, but that won't stop quantum chemistry software developers
from calling it at the top and bottom of every subroutine.  =O

While there are too many GA_Sync calls in NWChem, removing the
majority of them requires a complete rewrite of very complex
algorithms and problem redesigning the entire code from top to bottom.

I have never seen GA_Fence used.  All ~remote~ completion in NWChem is
collective.  All three modes of local completion - trivial (blocking),
individual (request-based) and bulk (fenced target) are all used
explicitly via GA or implicitly within GA but invisible to the GA
user.

GA_Fence was probably added much later in response to NWChem users, so
it isn't surprising it is never called in NWChem except in a "test all
GA functionality" utility routine.

> UPC has a similar scenario:  barrier implies a strict access that causes a flushall, but everybody knows that you avoid calling barrier to the extent you possibly can.  So, the model you use is actually focused on minimizing either strict accesses or barriers, but you would typically rather do a strict access than a barrier.  If GA code is optimized the same way, how often is GA_Sync() called?

GA_Sync is called like crazy in NWChem.  There is an effort to fix
this pathological behavior in one particularly heinous abuser (TCE)
but it will be a long time before NWChem learns to be kind to network
hardware.

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond