[mpiwg-rma] MPI_Win_all_lock_all

Tue Feb 3 00:16:47 CST 2015

The argument was that we don't want to add a collective function in passive target RMA, and should instead use Fence.  Specifically, since all-lock-all forces shared lock (it doesn't make sense for exclusive lock), one would give mode-no-check, and the MPI implementation can optimize the lock functionality to a no-op anyway.

However, if there was a proposal for collective flush or unlock, I'd be supportive of that.  The reason being that I can decide dynamically if I want to do a single flush-all or a collective flush-all based on my program semantics, instead of having to decide that at epoch opening time.

 -- Pavan

Sent from my iPhone

> On Feb 3, 2015, at 11:39 AM, Jeff Hammond <jeff.science at gmail.com> wrote:
> 
> I know we discussed a collective lock_all function at some point.  Why
> didn't we do it?
> 
> In every single MPI-3 RMA program I have every written, I do this:
> 
> ...
> MPI_Win_allocate(..,&win);
> MPI_Win_lock_all(win);
> ...
> 
> When everyone calls MPI_Win_lock_all, the traffic pattern is
> consistent with MPI_Alltoall.  On the other hand, a collective
> invocation would allow for setting the LOCK_ALL state locally and
> merely achieving consensus on this effect, which means MPI_Allreduce
> traffic.
> 
> In terms of the performance difference one will see with these
> implementations, here is some data for Blue Gene/Q on 48 racks (49152
> nodes with 16 ppn):
> 
> MPI_Allreduce 1 integers in 0.000018 seconds
> MPI_Alltoall 786432 integers in 1.52 seconds <- the count here means
> one integer per process
> 
> For those not keeping score at home, that's a performance difference
> of approximately 36314x.  And this is quite generous to
> MPI_Win_lock_all, because not only is it not collective and therefore
> MPI cannot possibly coalesce or otherwise optimize the packet storm,
> but instead of a simple data copy, an MPICH:Ch3-derived implementation
> is going to slam the target with nproc active message requests,
> leading to an enormously pile up at scale.
> 
> Here is more data (approximate, due to lack of repetition in
> measurement, although the times do not include connection overhead,
> etc.):
> 
> Cray XC30
> np=8192
> MPI_Allreduce 1 integers in 0.000033 seconds
> MPI_Alltoall 8192 integers in 0.0051 seconds
> 
> np=8192 (1024x8)
> MPI_Allreduce 2 integers in 0.000015 seconds
> MPI_Alltoall 8192 integers in 0.0034 seconds
> 
> In the limit of small np, the differences are less noticeable, as one
> might expect.
> 
> Unless there's a compelling reason not to fix this, I'll propose
> either MPI_Win_all_lock_all.  I have no use case for locking a subset
> of the window group collectively, so I do not intend to propose this.
> 
> Thanks,
> 
> Jeff
> 
> -- 
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma