[Mpi3-rma] MPI-3 UNIFIED model clarification

Jeff Hammond jeff.science at gmail.com
Tue Jul 30 18:01:51 CDT 2013

I would really not like to use the very narrow scope of today's
architectures to influence fundamental decisions like the RMA memory
model.  Things are going to change...


On Jul 30, 2013, at 5:51 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:

> While this might be true in the general case, I don't think this is true on any real network today.  Before the DMA is permitted, the cache is flushed/invalidated.
> -- Pavan
> On 07/30/2013 05:40 PM, Jed Brown wrote:
>> Pavan Balaji <balaji at mcs.anl.gov> writes:
>>> This is what I said is the disagreement in the WG.  I can pull up the
>>> old email chain if needed, but I think others can too.  One side was
>>> arguing that there's no such guarantee and you need to do a WIN_SYNC to
>>> see the value.  The other side was arguing that the WIN_SYNC should not
>>> be needed; FLUSH + SEND on the origin should be enough.
>> Hmm, linux/Documentation/memory-barriers.txt says:
>>   ----------------------
>>   Not all systems maintain cache coherency with respect to devices doing
>>   DMA.  In such cases, a device attempting DMA may obtain stale data
>>   from RAM because dirty cache lines may be resident in the caches of
>>   various CPUs, and may not have been written back to RAM yet.  To deal
>>   with this, the appropriate part of the kernel must flush the
>>   overlapping bits of cache on each CPU (and maybe invalidate them as
>>   well).
>>   In addition, the data DMA'd to RAM by a device may be overwritten by
>>   dirty cache lines being written back to RAM from a CPU's cache after
>>   the device has installed its own data, or cache lines present in the
>>   CPU's cache may simply obscure the fact that RAM has been updated,
>>   until at such time as the cacheline is discarded from the CPU's cache
>>   and reloaded.  To deal with this, the appropriate part of the kernel
>>   must invalidate the overlapping bits of the cache on each CPU.
>> I've taken this to mean that you can't guarantee that a DMA write will
>> "eventually" be visible to the CPU (because the cache line could hang
>> around arbitrarily long).  Are implementations doing something here to
>> ensure that cache on the target is invalidated after RDMA operations?
>> I think (though not with confidence) that CPU cache invalidation will
>> eventually (with a practical upper bound) propagate to other CPUs.
>> Comparing to Alpha with split cache lines, one of the buses could be
>> busy and thus not update despite proper memory ordering on the write
>> end.  The implication is that each cache has a (fair?) queue and it
>> cannot be arbitrarily long, though I've never seen a statement providing
>> an upper bound on how long the cache bank could be busy.
