[Mpi3-rma] MPI-3 UNIFIED model clarification

Tue Jul 30 19:49:20 CDT 2013

This is needed for UNIFIED.  If the hardware doesn't support this, it 
can still support SEPARATE.

On 07/30/2013 06:01 PM, Jeff Hammond wrote:
> I would really not like to use the very narrow scope of today's
> architectures to influence fundamental decisions like the RMA memory
> model.  Things are going to change...
>
> Jeff
>
> Sent from my iPhone
>
> On Jul 30, 2013, at 5:51 PM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
>>
>> While this might be true in the general case, I don't think this is true on any real network today.  Before the DMA is permitted, the cache is flushed/invalidated.
>>
>> -- Pavan
>>
>> On 07/30/2013 05:40 PM, Jed Brown wrote:
>>> Pavan Balaji <balaji at mcs.anl.gov> writes:
>>>> This is what I said is the disagreement in the WG.  I can pull up the
>>>> old email chain if needed, but I think others can too.  One side was
>>>> arguing that there's no such guarantee and you need to do a WIN_SYNC to
>>>> see the value.  The other side was arguing that the WIN_SYNC should not
>>>> be needed; FLUSH + SEND on the origin should be enough.
>>>
>>> Hmm, linux/Documentation/memory-barriers.txt says:
>>>
>>>    CACHE COHERENCY VS DMA
>>>    ----------------------
>>>
>>>    Not all systems maintain cache coherency with respect to devices doing
>>>    DMA.  In such cases, a device attempting DMA may obtain stale data
>>>    from RAM because dirty cache lines may be resident in the caches of
>>>    various CPUs, and may not have been written back to RAM yet.  To deal
>>>    with this, the appropriate part of the kernel must flush the
>>>    overlapping bits of cache on each CPU (and maybe invalidate them as
>>>    well).
>>>
>>>    In addition, the data DMA'd to RAM by a device may be overwritten by
>>>    dirty cache lines being written back to RAM from a CPU's cache after
>>>    the device has installed its own data, or cache lines present in the
>>>    CPU's cache may simply obscure the fact that RAM has been updated,
>>>    until at such time as the cacheline is discarded from the CPU's cache
>>>    and reloaded.  To deal with this, the appropriate part of the kernel
>>>    must invalidate the overlapping bits of the cache on each CPU.
>>>
>>>
>>> I've taken this to mean that you can't guarantee that a DMA write will
>>> "eventually" be visible to the CPU (because the cache line could hang
>>> around arbitrarily long).  Are implementations doing something here to
>>> ensure that cache on the target is invalidated after RDMA operations?
>>>
>>> I think (though not with confidence) that CPU cache invalidation will
>>> eventually (with a practical upper bound) propagate to other CPUs.
>>> Comparing to Alpha with split cache lines, one of the buses could be
>>> busy and thus not update despite proper memory ordering on the write
>>> end.  The implication is that each cache has a (fair?) queue and it
>>> cannot be arbitrarily long, though I've never seen a statement providing
>>> an upper bound on how long the cache bank could be busy.
>>
>> --
>> Pavan Balaji
>> http://www.mcs.anl.gov/~balaji
>> _______________________________________________
>> mpi3-rma mailing list
>> mpi3-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji