[Mpi3-rma] RMA proposal 1 update

Fri May 21 16:33:29 CDT 2010

> > The struggle here is that GA has both one-sided and collective
> > completion.  Collective completion can obviously be emulated by one-
> > sided completion + a barrier, but you have indicated that that is a
> > performance issue.  Unfortunately, there is not an obvious place inside
> > of the existing MPI RMA interface where the mixture of collective and
> > one-sided completion fit.  The need for both one-sided and collective
> > completion was clearly not architected into MPI one-sided and may very
> > well break the architecture.  It certainly breaks the naming (there is
> > nothing "passive" about a target that calls allfenceall and there is
> > nothing "active" about a target when an initiator does one-side
> > completion).
> 
> Clearly, allflushall is not passive, but it is a convenience operation
> to complete what would otherwise be a passive-target epoch.  That it
> would be inter-operable with all passive-target functionality, and
> perhaps not start/post/complete (which still makes no sense to me),
> makes it clearly part of the passive-target set.

Well, that is part of the issue... we are going to have to figure out the language, but I don't think it "completes the epoch" even for the flushall variant (that would be the unlock). 

> > Does anybody know of another model (other than GA) that calls for a
> > mixture of collective and one-sided completion?  CoArray Fortran uses
> > collective completion, UPC expects one-sided completion, SHMEM only
> > exposes one-sided completion, ARMCI only exposes one-sided
> > completion...  If we could look at a second model that needed a
> > mixture, it might help us formulate a better solution.
> 
> ARMCI has collective completion.  From
> <http://www.emsl.pnl.gov/docs/parsoft/armci/documentation.htm#compl>:
> 
> int ARMCI_Barrier()
> PURPOSE: Synchronize processors and memory. This operation combines
> functionality of
>         MPI_Barrier and ARMCI_AllFence.
> 
> I've been talking about allflushall because ARMCI_Barrier is partially
> redundant with MPI_Barrier, which is clearly already in MPI-3.

Intuitively named as ARMCI_Barrier - I imagine why I missed that ;-)  So, this offers some opportunity for experimentation...   ARMCI_Barrier should be what GA_Sync calls, right?  How is the implementation of ARMCI_Barrier done on BG?  It would seem that we could try 3 implementations in some real world usage scenarios and get some data:

1) ARMCI_Barrier the "really bad" way:  i.e. np^2 messages get generated and then a barrier is done
2) ARMCI_Barrier as a collective:  allreduce is used to agree about completion
3) ARMCI_Barrier assuming a "small" number of targets between barriers/fences:  you track who has outstanding messages and only sync with them.  

Anyway, just to clarify, GA regularly uses both one-sided and collective completion calls?  Or, is it dominated by one or the other?  I look at GA_Fence() and see the equivalent of flushall() and GA_Sync = GA_Fence() + MPI_Barrier().  If you call both, then you have this mixture of passive and active target, but... if GA_Sync is going to perform significantly better than GA_Fence(), couldn't you just switch to calling GA_Sync?  It would seem like users would rather have the barrier too (especially if it was cheaper than calling GA_Fence()).  Or, put another way, the online guide for GA essentially says "um, don't call GA_Sync() very often", so is allfenceall optimizing the infrequent case?  

UPC has a similar scenario:  barrier implies a strict access that causes a flushall, but everybody knows that you avoid calling barrier to the extent you possibly can.  So, the model you use is actually focused on minimizing either strict accesses or barriers, but you would typically rather do a strict access than a barrier.  If GA code is optimized the same way, how often is GA_Sync() called?  

Keith