[Mpi3-rma] RMA proposal 1 update

Fri May 21 16:48:03 CDT 2010

> > My contention was that the number of targets with outstanding
> requests from a given between one flush/flushall and the next one would
> frequently not be large enough to justify the cost of a collective.
> 
> (1) There exists a counterexample to your contention, and it is
> NWChem.  The number of messages NWChem dumps onto the network in some
> simulations is ridiculously.  I stuck a profiler on DCMF to measure
> the number of events that a call to advance() processes during a
> single call (it is called continuously with lock-cycling by the
> communication helper thread).  The biggest number I've seen so far is
> more than 60,000.  That number might not be equivalent to the number
> of ARMCI messages due to the way non-contiguous is handled (or not
> handled), but I think it indicates the order of magnitude that one has
> to deal with.

The number of messages in flight is not the same as the number of targets with a message in flight.  Is there any chance of collecting that piece of data?  Two ratios are relevant:  1) the number of targets to the number of messages (1000 messages each to 1000000 targets between flushalls would make the 1 message to each target for a non-collective flushall irrelevant), and 2) the number of targets to the time to do the allreduce (e.g. 600 messages to each of 100 targets would probably mean that you could do those 100 messages faster than you could do 2+ allreduces in the collective completion).  Since I don't have enough low level detail on BGP (or, more importantly, BGQ, since BGP will be out of service by the time we have an MPI3 RMA library that is optimized), I don't know where the crossover is.

> (2) In GA allflushall is going to be called along with a barrier, so
> the extra cost is negligible.
> 
> > Alternatively, if messages in that window are "large" (for some
> definition of "large" that is likely less than 4KB and is certainly no
> larger than the rendezvous threshold), I would contend that generating
> a software ack for each one would be essentially zero overhead and
> would allow source side tracking of remote completion such that a
> flushall could be a local operation.
> 
> There are going to be two contexts worth address:
> (1) bigger messages, where flushall is local and allflushall is just a
> barrier.
> (2) smaller messages, where allflushall matters.
> 
> Smaller messages matters because on networks with high-injection rate,
> no hardware support for strided messages and/or relatively good DMA
> performance relative to memcopy (low-frequency CPU ala BGP), it makes
> sense to do ARMCI non-contiguous operations by just sending one
> message per contiguous section.  In this scenario, software acks are
> not free and one would rather use a nonlocal flush, hence allflushall

It would take some BGQ data to figure out where the crossover is between "small" and "big".  E.g. say that a link is 1 GB/s per direction (total speculation on what those number might be - 1's are convenient for some of this math).  So, let's say an ack needs 16 bytes of network traffic.  An ack on a 1KB message would be 1.5% network traffic overhead - pretty tolerable.  Now, let's say you can process a header in 1 microsecond and that it adds 50 ns to generate an ack.   5% processing overhead...  Not too bad.  Oh, and that 1 microsecond means that you want to send 1KB messages on a 1 GB/s network, so, voila, you're good.  Now, the problem is, I don't know what ANY of those numbers actually are for BGQ ;-)  And, BG is the only platform that has been raised as a big issue for this.

> is useful.  ARMCI used to do allflushall in GA_Sync by flushing all
> links from every node, which deadlocked the network above 1K due to
> the nproc^2 total injection.  I rewrote it so it flushes active
> connections one at a time (added granularity helps prevent deadlock
> but is not good for performance), but this is not going to come close
> to the performance I could get with a collective on BGP.

Deadlock?  Eww.  How did you manage that?  An application (or library) should never be able to deadlock the network while using the lowest level network API... eww...

Keith