[Mpi3-rma] MPI-3 UNIFIED model clarification

Tue Jul 30 17:47:33 CDT 2013

While this might be true in the general case, I don't think this is true 
on any real network today.  Before the DMA is permitted, the cache is 
flushed/invalidated.

  -- Pavan

On 07/30/2013 05:40 PM, Jed Brown wrote:
> Pavan Balaji <balaji at mcs.anl.gov> writes:
>> This is what I said is the disagreement in the WG.  I can pull up the
>> old email chain if needed, but I think others can too.  One side was
>> arguing that there's no such guarantee and you need to do a WIN_SYNC to
>> see the value.  The other side was arguing that the WIN_SYNC should not
>> be needed; FLUSH + SEND on the origin should be enough.
>
> Hmm, linux/Documentation/memory-barriers.txt says:
>
>    CACHE COHERENCY VS DMA
>    ----------------------
>
>    Not all systems maintain cache coherency with respect to devices doing
>    DMA.  In such cases, a device attempting DMA may obtain stale data
>    from RAM because dirty cache lines may be resident in the caches of
>    various CPUs, and may not have been written back to RAM yet.  To deal
>    with this, the appropriate part of the kernel must flush the
>    overlapping bits of cache on each CPU (and maybe invalidate them as
>    well).
>
>    In addition, the data DMA'd to RAM by a device may be overwritten by
>    dirty cache lines being written back to RAM from a CPU's cache after
>    the device has installed its own data, or cache lines present in the
>    CPU's cache may simply obscure the fact that RAM has been updated,
>    until at such time as the cacheline is discarded from the CPU's cache
>    and reloaded.  To deal with this, the appropriate part of the kernel
>    must invalidate the overlapping bits of the cache on each CPU.
>
>
> I've taken this to mean that you can't guarantee that a DMA write will
> "eventually" be visible to the CPU (because the cache line could hang
> around arbitrarily long).  Are implementations doing something here to
> ensure that cache on the target is invalidated after RDMA operations?
>
> I think (though not with confidence) that CPU cache invalidation will
> eventually (with a practical upper bound) propagate to other CPUs.
> Comparing to Alpha with split cache lines, one of the buses could be
> busy and thus not update despite proper memory ordering on the write
> end.  The implication is that each cache has a (fair?) queue and it
> cannot be arbitrarily long, though I've never seen a statement providing
> an upper bound on how long the cache bank could be busy.
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji