[Mpi3-rma] RMA proposal 1 update
Michael Blocksome
blocksom at us.ibm.com
Mon May 24 13:24:40 CDT 2010
I am one of the DCMF developers that wrote the low-level implementation
for the various one-sided operations on BG/P and I've also done some work
with implementing ARMCI over DCMF on BG/P. I have some general comments,
but can't really find an appropriate email to 'reply to' .. so I'll just
dive in here.
I think the discussion of the three possible implementations below misses
the mark a bit. Any low level messaging layer (DCMF, LAPI, even "MPI")
that can optimize the allreduce and barrier operations could implement the
allfenceall operation as described below. Even though DCMF - which what
MPICH2 on BG/P is based on - uses the BG/P collective network, it still is
*only* a fast allreduce that has benefits over other pt2pt allreduce
algorithms and not very interesting when considering how an allfenceall
would be implemented.
If we look at any exascale goal(s), as Jeff mentioned earlier, we'll see
that any time an application or middleware needs to track counters we
won't get the scaling required (assuming the allfenceall is flexible and
not restricted to only "comm_world" operations). What is needed is an
interface that allows low level software to take advantage of any hardware
or network features that would enable a "counter-less" allfenceall
implementation. On BG/P, this would probably take advantage of features
such as the torus dimensions of the network, the (configurable)
deterministic packet routing, the "deposit bit" in the packet header, etc.
I'm not saying I have any specific design in mind, just that experience
tells me that we should be able to do something cool in this space
supported by the BG hardware which would eliminate the need for software
accounting.
If there was an allfenceall interface, then low level message could do
whatever it takes to get performance and scaling. Without the allfenceall
interface middleware is forced to use one of these platform independent
algorithms (allreduce, barrier, etc) and accept any inherent scaling and
performance restrictions.
Michael Blocksome
Blue Gene Messaging
Advanced Systems Software Development
blocksom at us.ibm.com
From:
Jeff Hammond <jeff.science at gmail.com>
To:
"MPI 3.0 Remote Memory Access working group"
<mpi3-rma at lists.mpi-forum.org>
Date:
05/21/2010 05:13 PM
Subject:
Re: [Mpi3-rma] RMA proposal 1 update
>> I've been talking about allflushall because ARMCI_Barrier is partially
>> redundant with MPI_Barrier, which is clearly already in MPI-3.
>
> Intuitively named as ARMCI_Barrier - I imagine why I missed that ;-)
So, this offers some opportunity for experimentation... ARMCI_Barrier
should be what GA_Sync calls, right? How is the implementation of
ARMCI_Barrier done on BG? It would seem that we could try 3
implementations in some real world usage scenarios and get some data:
Yes, and not to be confused with armci_msg_barrier
(<http://www.emsl.pnl.gov/docs/parsoft/armci/documentation.htm#collect>)
:(
> 1) ARMCI_Barrier the "really bad" way: i.e. np^2 messages get generated
and then a barrier is done
> 2) ARMCI_Barrier as a collective: allreduce is used to agree about
completion
> 3) ARMCI_Barrier assuming a "small" number of targets between
barriers/fences: you track who has outstanding messages and only sync
with them.
Yeah, I'm going to work on this as soon as I have some free time. I
have (1) and (3) already but I need to test them outside of NWChem to
have any hope of making sense of the data. I think I can at least
hack (2). I think I can implement a fourth option that beats (2) but
it has the same complexity model (logarithmic).
> Anyway, just to clarify, GA regularly uses both one-sided and collective
completion calls? Or, is it dominated by one or the other? I look at
GA_Fence() and see the equivalent of flushall() and GA_Sync = GA_Fence() +
MPI_Barrier(). If you call both, then you have this mixture of passive
and active target, but... if GA_Sync is going to perform significantly
better than GA_Fence(), couldn't you just switch to calling GA_Sync? It
would seem like users would rather have the barrier too (especially if it
was cheaper than calling GA_Fence()). Or, put another way, the online
guide for GA essentially says "um, don't call GA_Sync() very often", so is
allfenceall optimizing the infrequent case?
The GA manual can say that calling GA_Sync will bring on the
apocalypse, but that won't stop quantum chemistry software developers
from calling it at the top and bottom of every subroutine. =O
While there are too many GA_Sync calls in NWChem, removing the
majority of them requires a complete rewrite of very complex
algorithms and problem redesigning the entire code from top to bottom.
I have never seen GA_Fence used. All ~remote~ completion in NWChem is
collective. All three modes of local completion - trivial (blocking),
individual (request-based) and bulk (fenced target) are all used
explicitly via GA or implicitly within GA but invisible to the GA
user.
GA_Fence was probably added much later in response to NWChem users, so
it isn't surprising it is never called in NWChem except in a "test all
GA functionality" utility routine.
> UPC has a similar scenario: barrier implies a strict access that causes
a flushall, but everybody knows that you avoid calling barrier to the
extent you possibly can. So, the model you use is actually focused on
minimizing either strict accesses or barriers, but you would typically
rather do a strict access than a barrier. If GA code is optimized the
same way, how often is GA_Sync() called?
GA_Sync is called like crazy in NWChem. There is an effort to fix
this pathological behavior in one particularly heinous abuser (TCE)
but it will be a long time before NWChem learns to be kind to network
hardware.
Jeff
--
Jeff Hammond
Argonne Leadership Computing Facility
jhammond at mcs.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20100524/4a4594c5/attachment-0001.html>
More information about the mpiwg-rma
mailing list