<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:p="urn:schemas-microsoft-com:office:powerpoint" xmlns:a="urn:schemas-microsoft-com:office:access" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:z="#RowsetSchema" xmlns:b="urn:schemas-microsoft-com:office:publisher" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:odc="urn:schemas-microsoft-com:office:odc" xmlns:oa="urn:schemas-microsoft-com:office:activation" xmlns:html="http://www.w3.org/TR/REC-html40" xmlns:q="http://schemas.xmlsoap.org/soap/envelope/" xmlns:rtc="http://microsoft.com/officenet/conferencing" xmlns:D="DAV:" xmlns:Repl="http://schemas.microsoft.com/repl/" xmlns:mt="http://schemas.microsoft.com/sharepoint/soap/meetings/" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:ppda="http://www.passport.com/NameSpace.xsd" xmlns:ois="http://schemas.microsoft.com/sharepoint/soap/ois/" xmlns:dir="http://schemas.microsoft.com/sharepoint/soap/directory/" xmlns:ds="http://www.w3.org/2000/09/xmldsig#" xmlns:dsp="http://schemas.microsoft.com/sharepoint/dsp" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:sub="http://schemas.microsoft.com/sharepoint/soap/2002/1/alerts/" xmlns:ec="http://www.w3.org/2001/04/xmlenc#" xmlns:sp="http://schemas.microsoft.com/sharepoint/" xmlns:sps="http://schemas.microsoft.com/sharepoint/soap/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmlfile" xmlns:udcp2p="http://schemas.microsoft.com/data/udc/parttopart" xmlns:wf="http://schemas.microsoft.com/sharepoint/soap/workflow/" xmlns:dsss="http://schemas.microsoft.com/office/2006/digsig-setup" xmlns:dssi="http://schemas.microsoft.com/office/2006/digsig" xmlns:mdssi="http://schemas.openxmlformats.org/package/2006/digital-signature" xmlns:mver="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns:mrels="http://schemas.openxmlformats.org/package/2006/relationships" xmlns:spwp="http://microsoft.com/sharepoint/webpartpages" xmlns:ex12t="http://schemas.microsoft.com/exchange/services/2006/types" xmlns:ex12m="http://schemas.microsoft.com/exchange/services/2006/messages" xmlns:pptsl="http://schemas.microsoft.com/sharepoint/soap/SlideLibrary/" xmlns:spsl="http://microsoft.com/webservices/SharePointPortalServer/PublishedLinksService" xmlns:Z="urn:schemas-microsoft-com:" xmlns:st="" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<!--[if !mso]>

<style>

v\:* {behavior:url(#default#VML);}

o\:* {behavior:url(#default#VML);}

w\:* {behavior:url(#default#VML);}

.shape {behavior:url(#default#VML);}

</style>

<![endif]-->

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

tt

        {mso-style-priority:99;

        font-family:"Courier New";}

span.EmailStyle18

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;}

@page Section1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.Section1

        {page:Section1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>

<body lang=EN-US link=blue vlink=purple>

<div class=Section1>

<div style='border:none;border-left:solid blue 1.5pt;padding:0in 0in 0in 4.0pt'>

<p class=MsoNormal style='margin-bottom:12.0pt'><tt><span style='font-size:

10.0pt'>My point was, the way Jeff is doing synchronization in NWChem is via a

fenceall(); barrier(); on the equivalent of MPI_COMM_WORLD. If I knew he was

going to be primarily doing this (ie, that he wanted to know that all nodes

were synched), I would do something like maintain counts of sent and received

messages on each node. I could then do something like an allreduce of those 2

ints over the tree to determine if everyone is synched. There are probably some

technical details that would have to be worked out to ensure this works but it

seems good from 10000 feet.</span></tt> <br>

<br>

<tt><span style='font-size:10.0pt'>Right now we do numprocs 0-byte get

operations to make sure the torus is flushed on each node. A torus operation is

~3us on a 512-way. It grows slowly with number of midplanes. I'm sure a 72 rack

longest Manhattan distance noncongested pingpong is <10us, but I don't have

the data in front of me.</span></tt> <br>

<br>

<span style='color:#1F497D'><o:p></o:p></span></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>Based on Doug’s email, I

had assumed you would know who you have sent messages to…  If you knew

that in a given fence interval the node had only sent distinct messages to 1K

other cores, you would only have 1K gets to issue.  Suck?  Yes.  Worse than the

tree messages?  Maybe, maybe not.  There is definitely a cross-over between 1

and np outstanding messages between fences where on the 1 side of things the

tree messages are worse and on the np side of things the tree messages are

better.  There is another spectrum based on request size where getting a

response for every request becomes an inconsequential overhead.  I would have

to know the cost of processing a message, the size of a response, and the cost

of generating that response to create a proper graph of that.  </span></b><span

style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><o:p></o:p></span></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><br>

<tt><span style='font-size:10.0pt'>A tree int/sum is roughly 5us on a 512-way

and grows similarly. I would postulate that a 72 rack MPI allreduce int/sum is

on the order of 10us.</span></tt><br>

<br>

<tt><span style='font-size:10.0pt'>So you generate np*np messages vs 1 tree

message. Contention and all the overhead of that many messages will be

significantly worse than even several tree messages.<span style='color:#1F497D'><o:p></o:p></span></span></tt></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>Oh, wait, so, you would sum

all sent and sum all received and then check if they were equal?  And then

(presumably) iterate until the answer was yes?  Hrm.  That is more

interesting.  Can you easily separate one-sided and two sided messages in your

counting while maintaining the performance of one-sided messages?<o:p></o:p></span></b></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>Doug’s earlier answer implied

you were going to allreduce a vector of counts (one per rank) and that would

have been ugly.   I am assuming you would do at least 2 tree messages in what I

believe you are describing, so there is still a crossover between n*np messages

and m tree messages (where n is the number of outstanding requests between

fencealls and 2 <= m <= 10), and the locality of communications impacts

that crossover…  <o:p></o:p></span></b></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>BTW, can you actually generate

messages fast enough to cause contention with tiny messages?<o:p></o:p></span></b></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><tt><span style='font-size:

10.0pt'>Anytime I know that an operation is collective, I can almost guarantee

I can do it better than even a good pt2pt algorithm if I am utilizing our

collective network. I think on machines that have remote completion

notification an allfenceall() is just a barrier(), and since fenceall();

barrier(); is going to be replaced by allfenceall(), it doesn't seem to me like

it is any extra overhead if allfenceall() is just a barrier() for you.</span></tt>

<br>

<br>

<tt><span style='font-size:10.0pt;color:#F79646'><o:p></o:p></span></tt></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>My concerns are twofold:  1)

we are talking about adding collective completion to passive target when active

target was the one designed to have collective completion.  That is

semantically and API-wise a bit ugly.  2) I think the allfenceall() as a

collective will optimize to the case where you have outstanding requests to

everybody and I believe that will be slower than the typical  case of having

outstanding requests to some people.  I think that users would typically call

allfenceall() rather than fenceall() + barrier() and then they would see a

performance paradox:  the fenceall() + barrier() could be substantially faster when

you have a “small” number of peers you are communicating with in

this iteration.  I am not at all worried about the overhead of allfenceall()

for networks with remote completion.  <o:p></o:p></span></b></p>

<p class=MsoNormal style='margin-bottom:12.0pt'><b><span style='font-size:11.0pt;

font-family:"Calibri","sans-serif";color:#F79646'>Keith<o:p></o:p></span></b></p>

<table class=MsoNormalTable border=0 cellpadding=0 width="100%"

 style='width:100.0%'>

 <tr>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif";

  color:#5F5F5F'>From:</span> <o:p></o:p></p>

  </td>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif"'>"Underwood,

  Keith D" <keith.d.underwood@intel.com></span> <o:p></o:p></p>

  </td>

 </tr>

 <tr>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif";

  color:#5F5F5F'>To:</span> <o:p></o:p></p>

  </td>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif"'>"MPI

  3.0 Remote Memory Access working group"

  <mpi3-rma@lists.mpi-forum.org></span> <o:p></o:p></p>

  </td>

 </tr>

 <tr>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif";

  color:#5F5F5F'>Date:</span> <o:p></o:p></p>

  </td>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif"'>05/20/2010

  09:19 AM</span> <o:p></o:p></p>

  </td>

 </tr>

 <tr>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif";

  color:#5F5F5F'>Subject:</span> <o:p></o:p></p>

  </td>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif"'>Re:

  [Mpi3-rma] RMA proposal 1 update</span> <o:p></o:p></p>

  </td>

 </tr>

 <tr>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif";

  color:#5F5F5F'>Sent by:</span> <o:p></o:p></p>

  </td>

  <td valign=top style='padding:.75pt .75pt .75pt .75pt'>

  <p class=MsoNormal><span style='font-size:7.5pt;font-family:"Arial","sans-serif"'><a

  href="mailto:mpi3-rma-bounces@lists.mpi-forum.org">mpi3-rma-bounces@lists.mpi-forum.org</a></span><o:p></o:p></p>

  </td>

 </tr>

</table>

<p class=MsoNormal><o:p> </o:p></p>

<div class=MsoNormal align=center style='text-align:center'>

<hr size=2 width="100%" noshade style='color:#ACA899' align=center>

</div>

<p class=MsoNormal style='margin-bottom:12.0pt'><br>

<br>

<br>

<tt><span style='font-size:10.0pt'>> What is available in GA itself isn't

really relevant to the Forum.  We</span></tt><span style='font-size:10.0pt;

font-family:"Courier New"'><br>

<tt>> need the functionality that enables someone to implement GA</tt><br>

<tt>> ~~~efficiently~~~ on current and future platforms.  We know ARMCI

is</tt><br>

<tt>> ~~~necessary~~~ to implement GA efficiently on some platforms, but</tt><br>

<tt>> Vinod and I can provide very important cases where it is ~~~not</tt><br>

<tt>> sufficient~~~.</tt><br>

<br>

<tt>Then let's enumerate those and work on a solution.</tt><br>

<br>

<tt>> The reason I want allfenceall is because a GA sync requires every</tt><br>

<tt>> process to fence all remote targets.  This is combined with a

barrier,</tt><br>

<tt>> hence it might as well be a collective operation for everyone to fence</tt><br>

<tt>> all remote targets.  On BGP, implementing GA sync with fenceall

from</tt><br>

<tt>> every node is hideous compared to what I can imagine can be done with</tt><br>

<tt>> active-message collectives.  I would bet a kidney it is hideous

on</tt><br>

<tt>> Jaguar.  Vinod can sell my kidney in Singapore if I'm wrong.</tt><br>

<tt>> </tt><br>

<tt>> The argument for allfenceall is the same as for sparse collectives.</tt><br>

<tt>> If there is an operation which could be done with multiple p2p calls,</tt><br>

<tt>> but has a collective character, it is guaranteed to be no worse to</tt><br>

<tt>> allow an MPI runtime to do it collectively.  I know that many</tt><br>

<tt>> applications will generate a sufficiently dense one-sided</tt><br>

<tt>> communication matrix to justify allfenceall.</tt><br>

<br>

<tt>So far, the argument I have heard for allflushall is:  BGP does not

give remote completion information to the source.  Surely making it

collective would be better. </tt><br>

<br>

<tt>When I challenged that and asked for an implementation sketch, the

implementation sketch provided is demonstrably worse for many scenarios than

calling flushall and a barrier.  It would be a lot easier for the IBM

people to do the math to show where the crossover point is, but so far, they

haven't. </tt><br>

<br>

<tt>> If you reject allfenceall, then I expect, and for intellectual</tt><br>

<tt>> consistency demand, that you vigorously protest against sparse</tt><br>

<tt>> collectives when they are proposed on the basis that they can</tt><br>

<tt>> obviously be done with p2p efficiently already.  Heck, why not

also</tt><br>

<tt>> deprecate all MPI_Bcast etc. since some on some networks it might not</tt><br>

<tt>> be faster than p2p?</tt><br>

<br>

<tt>MPI_Bcast can ALWAYS be made faster than a naïve implementation over p2p.

 That is the point of a collective.  </tt><br>

<br>

<tt>Ask Torsten how much flak I gave him over some of the things he has

proposed for this reason.  Torsten made a rational argument for sparse

collectives that they convey information that the system can use successfully

for optimization.  I'm not 100% convinced, but he had to make that

argument.  </tt><br>

<br>

<tt>> It is really annoying that you are such an obstructionist.  It is</tt><br>

<tt>> extremely counter-productive to the Forum and I know of no one</tt><br>

<br>

<tt>I am attempting to hold all things to the standards set for MPI-3:</tt><br>

<br>

<tt>1) you need a use case.</tt><br>

<tt>2) you need an implementation</tt><br>

<br>

<tt>Now, I tend to think that means you need an implementation that helps your

use case.  In this particular case, you are asking to add collective

completion to a one-sided completion model.  This is fundamentally

inconsistent with the design of MPI RMA, which separates active target

(collective completion) from passive target (one-sided completion).  This

maps well to much of the known world of PGAS-like models:  CoArray Fortran

uses collective completion and UPC uses one-sided completion (admittedly, a

call to barrier will give collective completion in UPC, but that is because a

barrier without completion is meaningless).  This mixture of the two

models puts us at risk of always getting poor one-sided completion

implementations, since there is the "out" of telling people to call

the collective completion routine.  This would effectively gut the

advantages of passive target.  </tt><br>

<br>

<tt>So far, we have proposed adding:</tt><br>

<br>

<tt>1) Completion independent of synchronization</tt><br>

<tt>2) Some key remote operations</tt><br>

<tt>3) an ability to operate on the full window in one epoch</tt><br>

<br>

<tt>In my opinion, adding collective communication to passive target is a much

bigger deal.</tt><br>

<br>

<tt>> deriving intellectual benefit from the endless stream of protests and</tt><br>

<tt>> demands for OpenSHMEM-like behavior.  As the ability to implement

GA</tt><br>

<tt>> on top of MPI-3 RMA is a stated goal of the working group, I feel no</tt><br>

<tt>> shame in proposing function calls which are motivated entirely by this</tt><br>

<tt>> purpose.</tt><br>

<br>

<tt>Endless stream of demands for OpenSHMEM-like behavior?  I have asked

(at times vigorously) for a memory model that would support the UPC memory

model.  The ability to support UPC is also in that stated goal along with

implementing GA.  I have used SHMEM as an example of that memory model

being done in an API and having hardware support from vendors.  I have

also argued that the memory model that supports UPC would be attractive to

SHMEM users and that OpenSHMEM is likely to be a competitor for mind share for

RMA-like programming models.  I have lost that argument to the relatively

vague "that might make performance worse in some cases".  I find

that frustrating, but I don't think I have raised it since the last meeting.</tt><br>

<br>

<tt>Keith</tt><br>

<br>

<tt>_______________________________________________</tt><br>

<tt>mpi3-rma mailing list</tt><br>

<tt>mpi3-rma@lists.mpi-forum.org</tt><br>

</span><a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma"><tt><span

style='font-size:10.0pt'>http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-rma</span></tt></a><span

style='font-size:10.0pt;font-family:"Courier New"'><br>

<br>

</span><o:p></o:p></p>

</div>

</div>

</body>

</html>