[Mpi3-rma] RMA proposal 1 update

Fri May 21 15:42:57 CDT 2010

> My contention was that the number of targets with outstanding requests from a given between one flush/flushall and the next one would frequently not be large enough to justify the cost of a collective.

(1) There exists a counterexample to your contention, and it is
NWChem.  The number of messages NWChem dumps onto the network in some
simulations is ridiculously.  I stuck a profiler on DCMF to measure
the number of events that a call to advance() processes during a
single call (it is called continuously with lock-cycling by the
communication helper thread).  The biggest number I've seen so far is
more than 60,000.  That number might not be equivalent to the number
of ARMCI messages due to the way non-contiguous is handled (or not
handled), but I think it indicates the order of magnitude that one has
to deal with.

(2) In GA allflushall is going to be called along with a barrier, so
the extra cost is negligible.

> Alternatively, if messages in that window are "large" (for some definition of "large" that is likely less than 4KB and is certainly no larger than the rendezvous threshold), I would contend that generating a software ack for each one would be essentially zero overhead and would allow source side tracking of remote completion such that a flushall could be a local operation.

There are going to be two contexts worth address:
(1) bigger messages, where flushall is local and allflushall is just a barrier.
(2) smaller messages, where allflushall matters.

Smaller messages matters because on networks with high-injection rate,
no hardware support for strided messages and/or relatively good DMA
performance relative to memcopy (low-frequency CPU ala BGP), it makes
sense to do ARMCI non-contiguous operations by just sending one
message per contiguous section.  In this scenario, software acks are
not free and one would rather use a nonlocal flush, hence allflushall
is useful.  ARMCI used to do allflushall in GA_Sync by flushing all
links from every node, which deadlocked the network above 1K due to
the nproc^2 total injection.  I rewrote it so it flushes active
connections one at a time (added granularity helps prevent deadlock
but is not good for performance), but this is not going to come close
to the performance I could get with a collective on BGP.

Jeff

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond