[Mpi3-rma] MPI 3 RMA Examples needed

Sat Feb 6 15:05:49 CST 2010

Sorry for the delay.  Maybe it is still Friday somewhere.

Many quantum chemistry algorithms use shared-memory semantics where
arrays are either (1) get/read-only or (2) acc/rmw-only.  I don't see
many calls to put.  Many collectives could be non-blocking, but I
don't know if the performance impact would be significant.

In the Fock build procedure of distributed-data SCF/DFT, the code
looks approximately like this [Comp. Phys. Comm. 143 (2002) 69–82]:
=============================================================
matrices D and F distributed over processors (partial replication
would be good but we can ignore this)
R > ~2000 // if not, then use replicated-data (local D, F) algorithm instead
N > ~400 // proportional to R
D(r,r) is read-only
F(r,r) is rmw-only

while not converged do
{
zero F
forall i in range(1:R) by chunks // chunks can be 1-14 for atomic
shells or 20-100 for whole atoms at a time
forall j in range(1:R) by chunks
// can dynamic load balance here OR
forall k in range(1:R) by chunks
forall l in range(1:R) by chunks
// can dynamic load balance here
  get D(i,j), D(i,k), D(i,l), D(j,k), D(j,l), D(k,l) from global D
  compute V(i,j,k,l) locally
  compute F(i,j), F(i,k), F(i,l), F(j,k), F(j,l), F(k,l) from D chunks and V
  accumulate local F(i,j), F(i,k), F(i,l), F(j,k), F(j,l), F(k,l) to global F
end forall
end forall
end forall
end forall
fence
// do some dense linear algebra - using Matlab notation
[E,U] = eig(F)
D = U(1:N,1:N)*transpose(U(1:N,1:N))
fence
}
end
=============================================================

It is not impossible to implement this code with one-sided.  MPQC uses
only MPI two-sided calls and the SCF scales as well as NWChem on
BlueGene/P, but (1) that is not an apples-to-apples comparison because
of many factors, like the routines to compute V (2) BGP is biased in
favor of an MPI+Pthreads programming model - , which is what MPQC uses
- relative to GA/ARMCI running a process per core.

With the amount of memory per node on most machines now, the
replicated-to-distributed cutoff moves steadily upwards, especially if
one uses shmem on the node.  Hence, if it were scalable, using
collectives would be fine.  However, dynamic-load-balancing (DLB) is a
huge issue to do the varying granularity of work associated with the
calculation of V.

In summary, I am not certain that changes to the existing RMA
semantics are necessary to implement the aforementioned algorithm
efficiently in MPI-2 except for the need to have RMW to implement DLB.

A second class of quantum chemistry methods is far more memory
intensive but a more favorable task-size distribution.  The formation
of a coupled-cluster intermediate or residual looks approximately like
one of the following:

=============================================================

N > 200
R > 1000
J = 4D array of size O(N^4) which is antisymmetic in index pairs over N and R
T = 4D array of size O(N^2 * R^2) which is antisymmetic in index pairs
over N and R
V = 4D array of size O(N^2 * R^2) which is antisymmetic in index pairs
over N and R

// trying to compute J(i,j,k,l) += sum(a,b) T(a,b,k,l) * V(i,j,a,b)

forall i in range(1:R) by chunks of ~20
forall j in range(1:R) by chunks of ~20
if i<j
forall k in range(1:R) by chunks of ~20
forall l in range(1:R) by chunks of ~20
if k<l
forall a in range(1:R) by chunks of ~50
forall b in range(1:R) by chunks of ~50
if a<b
  call dynamic_load_balance
  get T(a,b,k,l) // this is a 20x20x50x50 patch
  get V(i,j,a,b) // this is a 20x20x50x50 patch
  dgemm T(a,b,k,l) and V(i,j,a,b) to form J(i,j,k,l)
  accumulate J(i,j,k,l) to global J // this is a 20x20x20x20 patch
end forall
end forall
end forall
end forall
end forall
end forall

=============================================================

N > 200
R > 1000
J = 2D array of size O(N * R)
T = 4D array of size O(N^2 * R^2) which is antisymmetic in index pairs
over N and R
V = same as T

// trying to compute J(j,a) += sum(b,i) T(b,i) * V(i,j,a,b)

broadcast T
forall j in range(1:R) by chunks of ~20
forall i in range(1:R) by chunks of ~20
if i<j
forall a in range(1:R) by chunks of ~50
forall b in range(1:R) by chunks of ~50
if a<b
  call dynamic_load_balance
  get V(i,j,a,b) // this is a 20x20x50x50 patch
  do J(j,a) += sum(b,i) T(b,i) * V(i,j,a,b) for local patches
end forall
end forall
end forall
end forall
end forall
end forall
accumulate-reduce J
broadcast J // this might be over a different communication than the reduction

=============================================================

N > 200
R > 1000
J = 2D array of size O(N^2)
T = 4D array of size O(N^2 * R^2) which is antisymmetic in index pairs
over N and R
V = 4D array of size O(N^2 * R^2) which is antisymmetic in index pairs
over N and R

// trying to compute J(j,k) += sum(a,b,i) T(a,b,k,i) * V(i,j,a,b)

forall j in range(1:R) by chunks of ~20
forall k in range(1:R) by chunks of ~20
forall a in range(1:R) by chunks of ~50
forall b in range(1:R) by chunks of ~50
if a<b
forall i in range(1:R) by chunks of ~20
if k<i and i<j
  call dynamic_load_balance
  get T(a,b,k,i) // this is a 20x20x50x50 patch
  get V(i,j,a,b) // this is a 20x20x50x50 patch
  do J(j,k) += sum(a,b,i) T(a,b,k,i) * V(i,j,a,b) for local patches
end forall
end forall
end forall
end forall
end forall
end forall
accumulate-reduce J
broadcast J // this might be over a different communication than the reduction

=============================================================

I can say with certainty that passive progress is critical to the
performance of the above three procedures.  The local work inevitably
varies and all the processes will be out-of-sync almost immediately,
but since the local work can take up to a minute in the worst case, if
an incoming remote get or acc is blocked until that call finishes, the
overhead is huge.  On BGP, lack of passive progress decreased the
performance by 40-50% over the entire job.

I'm brushing much of the complexity of quantum many-body theory under
the rug, but if you look at the NWChem source, the above pseudo-code
is fairly well represented there.  Of course, there could be better
algorithms, but coding them up with the all index permutation symmetry
is a pain, hence the TCE project.

Best,

Jeff

On Sun, Jan 31, 2010 at 8:16 AM, William Gropp <wgropp at illinois.edu> wrote:
> Dear MPI RMA Group,
>
> We have several partial MPI RMA proposals.  To move forward, we need
> to have a better understanding of the real needs by users, and we will
> probably need to make some tough decisions about what we will support
> and what we won't (as Marc has noted, some fairly obvious shared
> memory operations are very tough in OpenMP, so its clear that being
> universal isn't requred).
>
> What we'd like by this *Friday* are some specific examples of
> operations that are hard to achieve in MPI RMA *and* that have a clear
> application need.  What we *don't* want is simply "we should have a
> better  Put", "we need active messages", or "the implementations I've used
> are
> too slow".  What we do want is something like the following:
>
> We've implemented a halo exchange with MPI-RMA, and the construction
> of the Memory windows is awkward and limiting, particularly if the
> domains are created dynamically, making it hard to create the memory
> windows collectively.  We need either a method that lets us export a
> local window or a way to allow all processes to refer to one single
> window (something like the MPI_WIN_WORLD proposal).  Example code can
> be found at <url here>  (or post on wiki).
>
> or
>
> We need a fetch and increment (or something similar) to implement a
> remote lock that will allow us to make a complex series of remote
> updates (and accesses) atomically that are needed for <specific
> application description here>.  As shown in Using MPI-2, while a fetch
> and increment is possible in MPI-RMA, it is extremely complex and
> awkward.
>
> We'll take these examples and compare them to the current proposals
> and the original MPI RMA in order to evaluate where we are.
>
> Again, please send us your concrete requirements by Friday, Feb 5th.
> Thanks!
>
> Bill and Rajeev
>
> William Gropp
> Deputy Director for Research
> Institute for Advanced Computing Applications and Technologies
> Paul and Cynthia Saylor Professor of Computer Science
> University of Illinois Urbana-Champaign
>
>
>
>
> _______________________________________________
> mpi3-rma mailing list
> mpi3-rma at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
>

-- 
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond