[mpiwg-rma] MPI_Win_all_lock_all

Wed Feb 4 09:31:50 CST 2015

It doesn't in the common case.  The only case locks are pre-acquired is when it runs into resource exhaustion and doesn't have memory to store target-specific lock information.

Mode nocheck is optimized irrespective, so this is not a concern.

 -- Pavan

Sent from my iPhone

On Feb 4, 2015, at 8:50 AM, Jim Dinan <james.dinan at gmail.com<mailto:james.dinan at gmail.com>> wrote:

Does MPICH actually go out and acquire all of the shared locks immediately?  This used to be deferred and it only acquired the locks on processes that you actually communicate with.  Beyond that, MPICH should also piggyback the lock requests on the first RMA operation to a given target, when you give MPI_MODE_NOCHECK.  So, if you use the interface in the way intended, there should not be any extra communication from a lock_all.

 ~Jim.

On Tue, Feb 3, 2015 at 1:09 AM, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:
I know we discussed a collective lock_all function at some point.  Why
didn't we do it?

In every single MPI-3 RMA program I have every written, I do this:

...
MPI_Win_allocate(..,&win);
MPI_Win_lock_all(win);
...

When everyone calls MPI_Win_lock_all, the traffic pattern is
consistent with MPI_Alltoall.  On the other hand, a collective
invocation would allow for setting the LOCK_ALL state locally and
merely achieving consensus on this effect, which means MPI_Allreduce
traffic.

In terms of the performance difference one will see with these
implementations, here is some data for Blue Gene/Q on 48 racks (49152
nodes with 16 ppn):

MPI_Allreduce 1 integers in 0.000018 seconds
MPI_Alltoall 786432 integers in 1.52 seconds <- the count here means
one integer per process

For those not keeping score at home, that's a performance difference
of approximately 36314x.  And this is quite generous to
MPI_Win_lock_all, because not only is it not collective and therefore
MPI cannot possibly coalesce or otherwise optimize the packet storm,
but instead of a simple data copy, an MPICH:Ch3-derived implementation
is going to slam the target with nproc active message requests,
leading to an enormously pile up at scale.

Here is more data (approximate, due to lack of repetition in
measurement, although the times do not include connection overhead,
etc.):

Cray XC30
np=8192
MPI_Allreduce 1 integers in 0.000033 seconds
MPI_Alltoall 8192 integers in 0.0051 seconds

np=8192 (1024x8)
MPI_Allreduce 2 integers in 0.000015 seconds
MPI_Alltoall 8192 integers in 0.0034 seconds

In the limit of small np, the differences are less noticeable, as one
might expect.

Unless there's a compelling reason not to fix this, I'll propose
either MPI_Win_all_lock_all.  I have no use case for locking a subset
of the window group collectively, so I do not intend to propose this.

Thanks,

Jeff

--
Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com>
http://jeffhammond.github.io/
_______________________________________________
mpiwg-rma mailing list
mpiwg-rma at lists.mpi-forum.org<mailto:mpiwg-rma at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma

_______________________________________________
mpiwg-rma mailing list
mpiwg-rma at lists.mpi-forum.org<mailto:mpiwg-rma at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20150204/56ba21ba/attachment-0001.html>