<div dir="ltr">Does MPICH actually go out and acquire all of the shared locks immediately? This used to be deferred and it only acquired the locks on processes that you actually communicate with. Beyond that, MPICH should also piggyback the lock requests on the first RMA operation to a given target, when you give MPI_MODE_NOCHECK. So, if you use the interface in the way intended, there should not be any extra communication from a lock_all.<div><br></div><div> ~Jim.<br><div><div><br></div><div><br></div></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Feb 3, 2015 at 1:09 AM, Jeff Hammond <span dir="ltr"><<a href="mailto:jeff.science@gmail.com" target="_blank">jeff.science@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I know we discussed a collective lock_all function at some point. Why<br>
didn't we do it?<br>
<br>
In every single MPI-3 RMA program I have every written, I do this:<br>
<br>
...<br>
MPI_Win_allocate(..,&win);<br>
MPI_Win_lock_all(win);<br>
...<br>
<br>
When everyone calls MPI_Win_lock_all, the traffic pattern is<br>
consistent with MPI_Alltoall. On the other hand, a collective<br>
invocation would allow for setting the LOCK_ALL state locally and<br>
merely achieving consensus on this effect, which means MPI_Allreduce<br>
traffic.<br>
<br>
In terms of the performance difference one will see with these<br>
implementations, here is some data for Blue Gene/Q on 48 racks (49152<br>
nodes with 16 ppn):<br>
<br>
MPI_Allreduce 1 integers in 0.000018 seconds<br>
MPI_Alltoall 786432 integers in 1.52 seconds <- the count here means<br>
one integer per process<br>
<br>
For those not keeping score at home, that's a performance difference<br>
of approximately 36314x. And this is quite generous to<br>
MPI_Win_lock_all, because not only is it not collective and therefore<br>
MPI cannot possibly coalesce or otherwise optimize the packet storm,<br>
but instead of a simple data copy, an MPICH:Ch3-derived implementation<br>
is going to slam the target with nproc active message requests,<br>
leading to an enormously pile up at scale.<br>
<br>
Here is more data (approximate, due to lack of repetition in<br>
measurement, although the times do not include connection overhead,<br>
etc.):<br>
<br>
Cray XC30<br>
np=8192<br>
MPI_Allreduce 1 integers in 0.000033 seconds<br>
MPI_Alltoall 8192 integers in 0.0051 seconds<br>
<br>
np=8192 (1024x8)<br>
MPI_Allreduce 2 integers in 0.000015 seconds<br>
MPI_Alltoall 8192 integers in 0.0034 seconds<br>
<br>
In the limit of small np, the differences are less noticeable, as one<br>
might expect.<br>
<br>
Unless there's a compelling reason not to fix this, I'll propose<br>
either MPI_Win_all_lock_all. I have no use case for locking a subset<br>
of the window group collectively, so I do not intend to propose this.<br>
<br>
Thanks,<br>
<br>
Jeff<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Jeff Hammond<br>
<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a><br>
<a href="http://jeffhammond.github.io/" target="_blank">http://jeffhammond.github.io/</a><br>
_______________________________________________<br>
mpiwg-rma mailing list<br>
<a href="mailto:mpiwg-rma@lists.mpi-forum.org">mpiwg-rma@lists.mpi-forum.org</a><br>
<a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma" target="_blank">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma</a><br>
</font></span></blockquote></div><br></div>