<br><tt><font size=2>>So far, the argument I have heard for allflushall

is:  BGP does not give remote completion information to the source.

 Surely making it collective would be better. <br>

<br>

>When I challenged that and asked for an implementation sketch, the

implementation sketch provided is demonstrably worse for many scenarios

than calling flushall and a barrier.  It would >be a lot easier

for the IBM people to do the math to show where the crossover point is,

but so far, they haven't. <br>

</font></tt>

<br><tt><font size=2>My point was, the way Jeff is doing synchronization

in NWChem is via a fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD.

If I knew he was going to be primarily doing this (ie, that he wanted to

know that all nodes were synched), I would do something like maintain counts

of sent and received messages on each node. I could then do something like

an allreduce of those 2 ints over the tree to determine if everyone is

synched. There are probably some technical details that would have to be

worked out to ensure this works but it seems good from 10000 feet.</font></tt>

<br>

<br><tt><font size=2>Right now we do numprocs 0-byte get operations to

make sure the torus is flushed on each node. A torus operation is ~3us

on a 512-way. It grows slowly with number of midplanes. I'm sure a 72 rack

longest Manhattan distance noncongested pingpong is <10us, but I don't

have the data in front of me.</font></tt>

<br>

<br><tt><font size=2>A tree int/sum is roughly 5us on a 512-way and grows

similarly. I would postulate that a 72 rack MPI allreduce int/sum is on

the order of 10us. </font></tt>

<br>

<br><tt><font size=2>So you generate np*np messages vs 1 tree message.

Contention and all the overhead of that many messages will be significantly

worse than even several tree messages.</font></tt>

<br>

<br><tt><font size=2>I think you really summarized it for me on BGP at

least:</font></tt>

<br><tt><font size=2>>MPI_Bcast/(insert: "collective synchronization")

can ALWAYS be made faster than a naīve implementation over p2p.  That

is the point of a collective.  </font></tt>

<br>

<br><tt><font size=2>Anytime I know that an operation is collective, I

can almost guarantee I can do it better than even a good pt2pt algorithm

if I am utilizing our collective network. I think on machines that have

remote completion notification an allfenceall() is just a barrier(), and

since fenceall(); barrier(); is going to be replaced by allfenceall(),

it doesn't seem to me like it is any extra overhead if allfenceall() is

just a barrier() for you.</font></tt>

<br>

<br><tt><font size=2>Just my $.02.</font></tt>

<br>

<br>

<br><font size=2 face="sans-serif"><br>

Brian Smith (smithbr@us.ibm.com)<br>

BlueGene MPI Development/<br>

Communications Team Lead<br>

IBM Rochester<br>

Phone: 507 253 4717</font>

<br>

<br>

<br>

<table width=100%>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">From:</font>

<td><font size=1 face="sans-serif">"Underwood, Keith D" <keith.d.underwood@intel.com></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">To:</font>

<td><font size=1 face="sans-serif">"MPI 3.0 Remote Memory Access working

group" <mpi3-rma@lists.mpi-forum.org></font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Date:</font>

<td><font size=1 face="sans-serif">05/20/2010 09:19 AM</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Subject:</font>

<td><font size=1 face="sans-serif">Re: [Mpi3-rma] RMA proposal 1 update</font>

<tr valign=top>

<td><font size=1 color=#5f5f5f face="sans-serif">Sent by:</font>

<td><font size=1 face="sans-serif">mpi3-rma-bounces@lists.mpi-forum.org</font></table>

<br>

<hr noshade>

<br>

<br>

<br><tt><font size=2>> What is available in GA itself isn't really relevant

to the Forum.  We<br>

> need the functionality that enables someone to implement GA<br>

> ~~~efficiently~~~ on current and future platforms.  We know ARMCI

is<br>

> ~~~necessary~~~ to implement GA efficiently on some platforms, but<br>

> Vinod and I can provide very important cases where it is ~~~not<br>

> sufficient~~~.<br>

<br>

Then let's enumerate those and work on a solution.<br>

<br>

> The reason I want allfenceall is because a GA sync requires every<br>

> process to fence all remote targets.  This is combined with a

barrier,<br>

> hence it might as well be a collective operation for everyone to fence<br>

> all remote targets.  On BGP, implementing GA sync with fenceall

from<br>

> every node is hideous compared to what I can imagine can be done with<br>

> active-message collectives.  I would bet a kidney it is hideous

on<br>

> Jaguar.  Vinod can sell my kidney in Singapore if I'm wrong.<br>

> <br>

> The argument for allfenceall is the same as for sparse collectives.<br>

> If there is an operation which could be done with multiple p2p calls,<br>

> but has a collective character, it is guaranteed to be no worse to<br>

> allow an MPI runtime to do it collectively.  I know that many<br>

> applications will generate a sufficiently dense one-sided<br>

> communication matrix to justify allfenceall.<br>

<br>

So far, the argument I have heard for allflushall is:  BGP does not

give remote completion information to the source.  Surely making it

collective would be better. <br>

<br>

When I challenged that and asked for an implementation sketch, the implementation

sketch provided is demonstrably worse for many scenarios than calling flushall

and a barrier.  It would be a lot easier for the IBM people to do

the math to show where the crossover point is, but so far, they haven't.

<br>

<br>

> If you reject allfenceall, then I expect, and for intellectual<br>

> consistency demand, that you vigorously protest against sparse<br>

> collectives when they are proposed on the basis that they can<br>

> obviously be done with p2p efficiently already.  Heck, why not

also<br>

> deprecate all MPI_Bcast etc. since some on some networks it might

not<br>

> be faster than p2p?<br>

<br>

MPI_Bcast can ALWAYS be made faster than a naīve implementation over p2p.

 That is the point of a collective.  <br>

<br>

Ask Torsten how much flak I gave him over some of the things he has proposed

for this reason.  Torsten made a rational argument for sparse collectives

that they convey information that the system can use successfully for optimization.

 I'm not 100% convinced, but he had to make that argument.  <br>

<br>

> It is really annoying that you are such an obstructionist.  It

is<br>

> extremely counter-productive to the Forum and I know of no one<br>

<br>

I am attempting to hold all things to the standards set for MPI-3:<br>

<br>

1) you need a use case.<br>

2) you need an implementation<br>

<br>

Now, I tend to think that means you need an implementation that helps your

use case.  In this particular case, you are asking to add collective

completion to a one-sided completion model.  This is fundamentally

inconsistent with the design of MPI RMA, which separates active target

(collective completion) from passive target (one-sided completion).  This

maps well to much of the known world of PGAS-like models:  CoArray

Fortran uses collective completion and UPC uses one-sided completion (admittedly,

a call to barrier will give collective completion in UPC, but that is because

a barrier without completion is meaningless).  This mixture of the

two models puts us at risk of always getting poor one-sided completion

implementations, since there is the "out" of telling people to

call the collective completion routine.  This would effectively gut

the advantages of passive target.  <br>

<br>

So far, we have proposed adding:<br>

<br>

1) Completion independent of synchronization<br>

2) Some key remote operations<br>

3) an ability to operate on the full window in one epoch<br>

<br>

In my opinion, adding collective communication to passive target is a much

bigger deal.<br>

<br>

> deriving intellectual benefit from the endless stream of protests

and<br>

> demands for OpenSHMEM-like behavior.  As the ability to implement

GA<br>

> on top of MPI-3 RMA is a stated goal of the working group, I feel

no<br>

> shame in proposing function calls which are motivated entirely by

this<br>

> purpose.<br>

<br>

Endless stream of demands for OpenSHMEM-like behavior?  I have asked

(at times vigorously) for a memory model that would support the UPC memory

model.  The ability to support UPC is also in that stated goal along

with implementing GA.  I have used SHMEM as an example of that memory

model being done in an API and having hardware support from vendors.  I

have also argued that the memory model that supports UPC would be attractive

to SHMEM users and that OpenSHMEM is likely to be a competitor for mind

share for RMA-like programming models.  I have lost that argument

to the relatively vague "that might make performance worse in some

cases".  I find that frustrating, but I don't think I have raised

it since the last meeting.<br>

<br>

Keith<br>

<br>

_______________________________________________<br>

mpi3-rma mailing list<br>

mpi3-rma@lists.mpi-forum.org<br>

</font></tt><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma"><tt><font size=2>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</font></tt></a><tt><font size=2><br>

</font></tt>

<br>

<br>