[mpiwg-rma] MPI_Win_all_lock_all

Tue Feb 3 00:44:12 CST 2015

The expectation was that people who really wanted to do that would set MPI_MODE_NOCHECK and SHARED for the lock and handle the locking themselves in between calls to MPI_WIN_FLUSH and MPI_WIN_SYNC.  This turns MPI_WIN_LOCK_ALL() into kind of a syntactically required, start of program, no-op.  As such, further optimization of it was not performed.

> -----Original Message-----
> From: mpiwg-rma [mailto:mpiwg-rma-bounces at lists.mpi-forum.org] On
> Behalf Of Jeff Hammond
> Sent: Tuesday, February 03, 2015 1:09 AM
> To: MPI WG Remote Memory Access working group
> Subject: [mpiwg-rma] MPI_Win_all_lock_all
> 
> I know we discussed a collective lock_all function at some point.  Why didn't
> we do it?
> 
> In every single MPI-3 RMA program I have every written, I do this:
> 
> ...
> MPI_Win_allocate(..,&win);
> MPI_Win_lock_all(win);
> ...
> 
> When everyone calls MPI_Win_lock_all, the traffic pattern is consistent with
> MPI_Alltoall.  On the other hand, a collective invocation would allow for
> setting the LOCK_ALL state locally and merely achieving consensus on this
> effect, which means MPI_Allreduce traffic.
> 
> In terms of the performance difference one will see with these
> implementations, here is some data for Blue Gene/Q on 48 racks (49152
> nodes with 16 ppn):
> 
> MPI_Allreduce 1 integers in 0.000018 seconds MPI_Alltoall 786432 integers
> in 1.52 seconds <- the count here means one integer per process
> 
> For those not keeping score at home, that's a performance difference of
> approximately 36314x.  And this is quite generous to MPI_Win_lock_all,
> because not only is it not collective and therefore MPI cannot possibly
> coalesce or otherwise optimize the packet storm, but instead of a simple
> data copy, an MPICH:Ch3-derived implementation is going to slam the target
> with nproc active message requests, leading to an enormously pile up at
> scale.
> 
> Here is more data (approximate, due to lack of repetition in measurement,
> although the times do not include connection overhead,
> etc.):
> 
> Cray XC30
> np=8192
> MPI_Allreduce 1 integers in 0.000033 seconds MPI_Alltoall 8192 integers in
> 0.0051 seconds
> 
> np=8192 (1024x8)
> MPI_Allreduce 2 integers in 0.000015 seconds MPI_Alltoall 8192 integers in
> 0.0034 seconds
> 
> In the limit of small np, the differences are less noticeable, as one might
> expect.
> 
> Unless there's a compelling reason not to fix this, I'll propose either
> MPI_Win_all_lock_all.  I have no use case for locking a subset of the window
> group collectively, so I do not intend to propose this.
> 
> Thanks,
> 
> Jeff
> 
> --
> Jeff Hammond
> jeff.science at gmail.com
> http://jeffhammond.github.io/
> _______________________________________________
> mpiwg-rma mailing list
> mpiwg-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-rma