[Mpi3-rma] MPI-3 UNIFIED model clarification
Pavan Balaji
balaji at mcs.anl.gov
Tue Jul 30 19:49:20 CDT 2013
This is needed for UNIFIED. If the hardware doesn't support this, it
can still support SEPARATE.
On 07/30/2013 06:01 PM, Jeff Hammond wrote:
> I would really not like to use the very narrow scope of today's
> architectures to influence fundamental decisions like the RMA memory
> model. Things are going to change...
>
> Jeff
>
> Sent from my iPhone
>
> On Jul 30, 2013, at 5:51 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> While this might be true in the general case, I don't think this is true on any real network today. Before the DMA is permitted, the cache is flushed/invalidated.
>>
>> -- Pavan
>>
>> On 07/30/2013 05:40 PM, Jed Brown wrote:
>>> Pavan Balaji <balaji at mcs.anl.gov> writes:
>>>> This is what I said is the disagreement in the WG. I can pull up the
>>>> old email chain if needed, but I think others can too. One side was
>>>> arguing that there's no such guarantee and you need to do a WIN_SYNC to
>>>> see the value. The other side was arguing that the WIN_SYNC should not
>>>> be needed; FLUSH + SEND on the origin should be enough.
>>>
>>> Hmm, linux/Documentation/memory-barriers.txt says:
>>>
>>> CACHE COHERENCY VS DMA
>>> ----------------------
>>>
>>> Not all systems maintain cache coherency with respect to devices doing
>>> DMA. In such cases, a device attempting DMA may obtain stale data
>>> from RAM because dirty cache lines may be resident in the caches of
>>> various CPUs, and may not have been written back to RAM yet. To deal
>>> with this, the appropriate part of the kernel must flush the
>>> overlapping bits of cache on each CPU (and maybe invalidate them as
>>> well).
>>>
>>> In addition, the data DMA'd to RAM by a device may be overwritten by
>>> dirty cache lines being written back to RAM from a CPU's cache after
>>> the device has installed its own data, or cache lines present in the
>>> CPU's cache may simply obscure the fact that RAM has been updated,
>>> until at such time as the cacheline is discarded from the CPU's cache
>>> and reloaded. To deal with this, the appropriate part of the kernel
>>> must invalidate the overlapping bits of the cache on each CPU.
>>>
>>>
>>> I've taken this to mean that you can't guarantee that a DMA write will
>>> "eventually" be visible to the CPU (because the cache line could hang
>>> around arbitrarily long). Are implementations doing something here to
>>> ensure that cache on the target is invalidated after RDMA operations?
>>>
>>> I think (though not with confidence) that CPU cache invalidation will
>>> eventually (with a practical upper bound) propagate to other CPUs.
>>> Comparing to Alpha with split cache lines, one of the buses could be
>>> busy and thus not update despite proper memory ordering on the write
>>> end. The implication is that each cache has a (fair?) queue and it
>>> cannot be arbitrarily long, though I've never seen a statement providing
>>> an upper bound on how long the cache bank could be busy.
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpiwg-rma
mailing list