[Mpi3-rma] RMA proposal 1 update

Fri May 21 08:36:52 CDT 2010

A few general comments - I think you and I are mixing the existing BGP MPI 
one-sided implementation and the ARMCI implementation. I'm not as familiar 
with how Doug is doing stuff in one-sided. I'm more familiar with our 
ARMCI implementation.

So, Doug's comments about how we would do things in MPI one-sided are 
probably valid; I'll have him reply to this thread though to confirm. I'm 
mostly offering comments based on our existing ARMCI implementation.

For ARMCI, I'm pretty sure I would do some sort of allreduce-and-iterate, 
yes. We could maybe do some tricks to minimize how many times we have to 
iterate, but I haven't thought about this much yet. And, yes, we can 
easily separate 1-sided vs 2-sided completion counting requirements in 
ARMCI. In MPI, it would still be fairly trivial; there is enough "glue" 
from 1-sided or 2-sided into DCMF that we could do the completion counting 
only on 1-sided. 

There would be some crossover point for collective vs our existing pt2pt 
fence scheme (or, some better scheme we haven't implemented), but I think 
the crossover would be very small in BG terms - I bet on the order of a 
midplane (512 nodes) or less, at least for how Jeff has described his comm 
patterns. 

We have plenty of benchmarks that can generate significant contention even 
with small(er) messages, especially as node count grows. Also, keep in 
mind we have no real flow control... Adaptive routing can help, but we 
don't usually use that for small messages.

Brian Smith (smithbr at us.ibm.com)
BlueGene MPI Development/
Communications Team Lead
IBM Rochester
Phone: 507 253 4717

From:
"Underwood, Keith D" <keith.d.underwood at intel.com>
To:
"MPI 3.0 Remote Memory Access working group" 
<mpi3-rma at lists.mpi-forum.org>
Date:
05/20/2010 04:57 PM
Subject:
Re: [Mpi3-rma] RMA proposal 1 update
Sent by:
mpi3-rma-bounces at lists.mpi-forum.org

My point was, the way Jeff is doing synchronization in NWChem is via a 
fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD. If I knew he 
was going to be primarily doing this (ie, that he wanted to know that all 
nodes were synched), I would do something like maintain counts of sent and 
received messages on each node. I could then do something like an 
allreduce of those 2 ints over the tree to determine if everyone is 
synched. There are probably some technical details that would have to be 
worked out to ensure this works but it seems good from 10000 feet. 

Right now we do numprocs 0-byte get operations to make sure the torus is 
flushed on each node. A torus operation is ~3us on a 512-way. It grows 
slowly with number of midplanes. I'm sure a 72 rack longest Manhattan 
distance noncongested pingpong is <10us, but I don't have the data in 
front of me. 

Based on Doug’s email, I had assumed you would know who you have sent 
messages to…  If you knew that in a given fence interval the node had only 
sent distinct messages to 1K other cores, you would only have 1K gets to 
issue.  Suck?  Yes.  Worse than the tree messages?  Maybe, maybe not.  
There is definitely a cross-over between 1 and np outstanding messages 
between fences where on the 1 side of things the tree messages are worse 
and on the np side of things the tree messages are better.  There is 
another spectrum based on request size where getting a response for every 
request becomes an inconsequential overhead.  I would have to know the 
cost of processing a message, the size of a response, and the cost of 
generating that response to create a proper graph of that.  

A tree int/sum is roughly 5us on a 512-way and grows similarly. I would 
postulate that a 72 rack MPI allreduce int/sum is on the order of 10us.

So you generate np*np messages vs 1 tree message. Contention and all the 
overhead of that many messages will be significantly worse than even 
several tree messages.
Oh, wait, so, you would sum all sent and sum all received and then check 
if they were equal?  And then (presumably) iterate until the answer was 
yes?  Hrm.  That is more interesting.  Can you easily separate one-sided 
and two sided messages in your counting while maintaining the performance 
of one-sided messages?
Doug’s earlier answer implied you were going to allreduce a vector of 
counts (one per rank) and that would have been ugly.   I am assuming you 
would do at least 2 tree messages in what I believe you are describing, so 
there is still a crossover between n*np messages and m tree messages 
(where n is the number of outstanding requests between fencealls and 2 <= 
m <= 10), and the locality of communications impacts that crossover…  
BTW, can you actually generate messages fast enough to cause contention 
with tiny messages?
Anytime I know that an operation is collective, I can almost guarantee I 
can do it better than even a good pt2pt algorithm if I am utilizing our 
collective network. I think on machines that have remote completion 
notification an allfenceall() is just a barrier(), and since fenceall(); 
barrier(); is going to be replaced by allfenceall(), it doesn't seem to me 
like it is any extra overhead if allfenceall() is just a barrier() for 
you. 

My concerns are twofold:  1) we are talking about adding collective 
completion to passive target when active target was the one designed to 
have collective completion.  That is semantically and API-wise a bit 
ugly.  2) I think the allfenceall() as a collective will optimize to the 
case where you have outstanding requests to everybody and I believe that 
will be slower than the typical  case of having outstanding requests to 
some people.  I think that users would typically call allfenceall() rather 
than fenceall() + barrier() and then they would see a performance 
paradox:  the fenceall() + barrier() could be substantially faster when 
you have a “small” number of peers you are communicating with in this 
iteration.  I am not at all worried about the overhead of allfenceall() 
for networks with remote completion.  
Keith

From: 
"Underwood, Keith D" <keith.d.underwood at intel.com> 
To: 
"MPI 3.0 Remote Memory Access working group" 
<mpi3-rma at lists.mpi-forum.org> 
Date: 
05/20/2010 09:19 AM 
Subject: 
Re: [Mpi3-rma] RMA proposal 1 update 
Sent by: 
mpi3-rma-bounces at lists.mpi-forum.org

> What is available in GA itself isn't really relevant to the Forum.  We
> need the functionality that enables someone to implement GA
> ~~~efficiently~~~ on current and future platforms.  We know ARMCI is
> ~~~necessary~~~ to implement GA efficiently on some platforms, but
> Vinod and I can provide very important cases where it is ~~~not
> sufficient~~~.

Then let's enumerate those and work on a solution.

> The reason I want allfenceall is because a GA sync requires every
> process to fence all remote targets.  This is combined with a barrier,
> hence it might as well be a collective operation for everyone to fence
> all remote targets.  On BGP, implementing GA sync with fenceall from
> every node is hideous compared to what I can imagine can be done with
> active-message collectives.  I would bet a kidney it is hideous on
> Jaguar.  Vinod can sell my kidney in Singapore if I'm wrong.
> 
> The argument for allfenceall is the same as for sparse collectives.
> If there is an operation which could be done with multiple p2p calls,
> but has a collective character, it is guaranteed to be no worse to
> allow an MPI runtime to do it collectively.  I know that many
> applications will generate a sufficiently dense one-sided
> communication matrix to justify allfenceall.

So far, the argument I have heard for allflushall is:  BGP does not give 
remote completion information to the source.  Surely making it collective 
would be better. 

When I challenged that and asked for an implementation sketch, the 
implementation sketch provided is demonstrably worse for many scenarios 
than calling flushall and a barrier.  It would be a lot easier for the IBM 
people to do the math to show where the crossover point is, but so far, 
they haven't. 

> If you reject allfenceall, then I expect, and for intellectual
> consistency demand, that you vigorously protest against sparse
> collectives when they are proposed on the basis that they can
> obviously be done with p2p efficiently already.  Heck, why not also
> deprecate all MPI_Bcast etc. since some on some networks it might not
> be faster than p2p?

MPI_Bcast can ALWAYS be made faster than a naïve implementation over p2p. 
That is the point of a collective. 

Ask Torsten how much flak I gave him over some of the things he has 
proposed for this reason.  Torsten made a rational argument for sparse 
collectives that they convey information that the system can use 
successfully for optimization.  I'm not 100% convinced, but he had to make 
that argument. 

> It is really annoying that you are such an obstructionist.  It is
> extremely counter-productive to the Forum and I know of no one

I am attempting to hold all things to the standards set for MPI-3:

1) you need a use case.
2) you need an implementation

Now, I tend to think that means you need an implementation that helps your 
use case.  In this particular case, you are asking to add collective 
completion to a one-sided completion model.  This is fundamentally 
inconsistent with the design of MPI RMA, which separates active target 
(collective completion) from passive target (one-sided completion).  This 
maps well to much of the known world of PGAS-like models:  CoArray Fortran 
uses collective completion and UPC uses one-sided completion (admittedly, 
a call to barrier will give collective completion in UPC, but that is 
because a barrier without completion is meaningless).  This mixture of the 
two models puts us at risk of always getting poor one-sided completion 
implementations, since there is the "out" of telling people to call the 
collective completion routine.  This would effectively gut the advantages 
of passive target. 

So far, we have proposed adding:

1) Completion independent of synchronization
2) Some key remote operations
3) an ability to operate on the full window in one epoch

In my opinion, adding collective communication to passive target is a much 
bigger deal.

> deriving intellectual benefit from the endless stream of protests and
> demands for OpenSHMEM-like behavior.  As the ability to implement GA
> on top of MPI-3 RMA is a stated goal of the working group, I feel no
> shame in proposing function calls which are motivated entirely by this
> purpose.

Endless stream of demands for OpenSHMEM-like behavior?  I have asked (at 
times vigorously) for a memory model that would support the UPC memory 
model.  The ability to support UPC is also in that stated goal along with 
implementing GA.  I have used SHMEM as an example of that memory model 
being done in an API and having hardware support from vendors.  I have 
also argued that the memory model that supports UPC would be attractive to 
SHMEM users and that OpenSHMEM is likely to be a competitor for mind share 
for RMA-like programming models.  I have lost that argument to the 
relatively vague "that might make performance worse in some cases".  I 
find that frustrating, but I don't think I have raised it since the last 
meeting.

Keith

_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma
_______________________________________________
mpi3-rma mailing list
mpi3-rma at lists.mpi-forum.org
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-rma/attachments/20100521/402871a6/attachment-0001.html>