[mpiwg-rma] MPI_Win_all_lock_all

Tue Feb 3 00:09:23 CST 2015

I know we discussed a collective lock_all function at some point.  Why
didn't we do it?

In every single MPI-3 RMA program I have every written, I do this:

...
MPI_Win_allocate(..,&win);
MPI_Win_lock_all(win);
...

When everyone calls MPI_Win_lock_all, the traffic pattern is
consistent with MPI_Alltoall.  On the other hand, a collective
invocation would allow for setting the LOCK_ALL state locally and
merely achieving consensus on this effect, which means MPI_Allreduce
traffic.

In terms of the performance difference one will see with these
implementations, here is some data for Blue Gene/Q on 48 racks (49152
nodes with 16 ppn):

MPI_Allreduce 1 integers in 0.000018 seconds
MPI_Alltoall 786432 integers in 1.52 seconds <- the count here means
one integer per process

For those not keeping score at home, that's a performance difference
of approximately 36314x.  And this is quite generous to
MPI_Win_lock_all, because not only is it not collective and therefore
MPI cannot possibly coalesce or otherwise optimize the packet storm,
but instead of a simple data copy, an MPICH:Ch3-derived implementation
is going to slam the target with nproc active message requests,
leading to an enormously pile up at scale.

Here is more data (approximate, due to lack of repetition in
measurement, although the times do not include connection overhead,
etc.):

Cray XC30
np=8192
MPI_Allreduce 1 integers in 0.000033 seconds
MPI_Alltoall 8192 integers in 0.0051 seconds

np=8192 (1024x8)
MPI_Allreduce 2 integers in 0.000015 seconds
MPI_Alltoall 8192 integers in 0.0034 seconds

In the limit of small np, the differences are less noticeable, as one
might expect.

Unless there's a compelling reason not to fix this, I'll propose
either MPI_Win_all_lock_all.  I have no use case for locking a subset
of the window group collectively, so I do not intend to propose this.

Thanks,

Jeff

-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/