[Mpi3-rma] MPI-3 UNIFIED model clarification
balaji at mcs.anl.gov
Tue Jul 30 17:47:33 CDT 2013
While this might be true in the general case, I don't think this is true
on any real network today. Before the DMA is permitted, the cache is
On 07/30/2013 05:40 PM, Jed Brown wrote:
> Pavan Balaji <balaji at mcs.anl.gov> writes:
>> This is what I said is the disagreement in the WG. I can pull up the
>> old email chain if needed, but I think others can too. One side was
>> arguing that there's no such guarantee and you need to do a WIN_SYNC to
>> see the value. The other side was arguing that the WIN_SYNC should not
>> be needed; FLUSH + SEND on the origin should be enough.
> Hmm, linux/Documentation/memory-barriers.txt says:
> CACHE COHERENCY VS DMA
> Not all systems maintain cache coherency with respect to devices doing
> DMA. In such cases, a device attempting DMA may obtain stale data
> from RAM because dirty cache lines may be resident in the caches of
> various CPUs, and may not have been written back to RAM yet. To deal
> with this, the appropriate part of the kernel must flush the
> overlapping bits of cache on each CPU (and maybe invalidate them as
> In addition, the data DMA'd to RAM by a device may be overwritten by
> dirty cache lines being written back to RAM from a CPU's cache after
> the device has installed its own data, or cache lines present in the
> CPU's cache may simply obscure the fact that RAM has been updated,
> until at such time as the cacheline is discarded from the CPU's cache
> and reloaded. To deal with this, the appropriate part of the kernel
> must invalidate the overlapping bits of the cache on each CPU.
> I've taken this to mean that you can't guarantee that a DMA write will
> "eventually" be visible to the CPU (because the cache line could hang
> around arbitrarily long). Are implementations doing something here to
> ensure that cache on the target is invalidated after RDMA operations?
> I think (though not with confidence) that CPU cache invalidation will
> eventually (with a practical upper bound) propagate to other CPUs.
> Comparing to Alpha with split cache lines, one of the buses could be
> busy and thus not update despite proper memory ordering on the write
> end. The implication is that each cache has a (fair?) queue and it
> cannot be arbitrarily long, though I've never seen a statement providing
> an upper bound on how long the cache bank could be busy.
More information about the mpiwg-rma