[mpiwg-rma] MPI_Win_all_lock_all

Tue Feb 3 00:42:12 CST 2015

MPI_Win_fence is completely useless in this context since I cannot
call MPI_Win_flush, etc. within a FENCE epoch.

So you are saying that MPI_MODE_NOCHECK turns MPI_Win_lock_all into a
local operation?  I'm fine with that if people agree it's a valid
optimization.

Jeff

On Mon, Feb 2, 2015 at 10:16 PM, Balaji, Pavan <balaji at anl.gov> wrote:
>
> The argument was that we don't want to add a collective function in passive target RMA, and should instead use Fence.  Specifically, since all-lock-all forces shared lock (it doesn't make sense for exclusive lock), one would give mode-no-check, and the MPI implementation can optimize the lock functionality to a no-op anyway.
>
> However, if there was a proposal for collective flush or unlock, I'd be supportive of that.  The reason being that I can decide dynamically if I want to do a single flush-all or a collective flush-all based on my program semantics, instead of having to decide that at epoch opening time.
>
>  -- Pavan
>
> Sent from my iPhone
>
>> On Feb 3, 2015, at 11:39 AM, Jeff Hammond <jeff.science at gmail.com> wrote:
>>
>> I know we discussed a collective lock_all function at some point.  Why
>> didn't we do it?
>>
>> In every single MPI-3 RMA program I have every written, I do this:
>>
>> ...
>> MPI_Win_allocate(..,&win);
>> MPI_Win_lock_all(win);
>> ...
>>
>> When everyone calls MPI_Win_lock_all, the traffic pattern is
>> consistent with MPI_Alltoall.  On the other hand, a collective
>> invocation would allow for setting the LOCK_ALL state locally and
>> merely achieving consensus on this effect, which means MPI_Allreduce
>> traffic.
>>
>> In terms of the performance difference one will see with these
>> implementations, here is some data for Blue Gene/Q on 48 racks (49152
>> nodes with 16 ppn):
>>
>> MPI_Allreduce 1 integers in 0.000018 seconds
>> MPI_Alltoall 786432 integers in 1.52 seconds <- the count here means
>> one integer per process
>>
>> For those not keeping score at home, that's a performance difference
>> of approximately 36314x.  And this is quite generous to
>> MPI_Win_lock_all, because not only is it not collective and therefore
>> MPI cannot possibly coalesce or otherwise optimize the packet storm,
>> but instead of a simple data copy, an MPICH:Ch3-derived implementation
>> is going to slam the target with nproc active message requests,
>> leading to an enormously pile up at scale.
>>
>> Here is more data (approximate, due to lack of repetition in
>> measurement, although the times do not include connection overhead,
>> etc.):
>>
>> Cray XC30
>> np=8192
>> MPI_Allreduce 1 integers in 0.000033 seconds
>> MPI_Alltoall 8192 integers in 0.0051 seconds
>>
>> np=8192 (1024x8)
>> MPI_Allreduce 2 integers in 0.000015 seconds
>> MPI_Alltoall 8192 integers in 0.0034 seconds
>>
>> In the limit of small np, the differences are less noticeable, as one
>> might expect.
>>
>> Unless there's a compelling reason not to fix this, I'll propose
>> either MPI_Win_all_lock_all.  I have no use case for locking a subset
>> of the window group collectively, so I do not intend to propose this.
>>
>> Thanks,
>>
>> Jeff
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>> _______________________________________________
>> mpiwg-rma mailing list
>> mpiwg-rma at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/