<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css" style="display:none"><!-- P { margin-top: 0px; margin-bottom: 0px; }--></style>
</head>
<body dir="ltr" style="font-size:12pt;color:#000000;background-color:#FFFFFF;font-family:Calibri,Arial,Helvetica,sans-serif;">
<p>We have data associated with our first persistent collective paper that show no significant advantage to blocking collective over nonblocking vs. persistent, even though we haven't optimized persistent a lot yet.<br>
</p>
<p><br>
</p>
<p>MPI's with strong progress can give you more benefits for long transfers, provided there is a good implementation and sufficient memory bandwidth, and you have something to do between Start and Wait... <br>
</p>
<p>we had success with point-to-point-based strong progress and overlap over 15 years ago... only for really short message applications did we want polling progress or progress only at wait....<br>
</p>
<p><br>
</p>
<p>Tony<br>
</p>
<p><br>
</p>
<p><br>
</p>
<div id="Signature">
<div name="divtagdefaultwrapper" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:; margin:0">
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div style="margin-top:0px; margin-bottom:0px"><font face="Times New Roman, Times, serif" size="2">Anthony Skjellum, PhD</font></div>
<div id="Signature">
<div style="margin:0px"><font face="Times New Roman, Times, serif" size="2">
<div style="margin:0px; background-color:rgb(255,255,255)">Professor of Computer Science and Software Engineering and</div>
<div style="margin:0px; background-color:rgb(255,255,255)"> Charles D. McCrary Eminent Scholar Endowed Chair</div>
<div style="margin:0px; background-color:rgb(255,255,255)">Director of the Charles D. McCrary Institute</div>
<div style="margin:0px; background-color:rgb(255,255,255)">Samuel Ginn College of Engineering</div>
</font></div>
</div>
<p class="p1" style="background-color:rgb(255,255,255)"><font face="Times New Roman, Times, serif" size="2">Auburn University<br>
e-mail: skjellum@auburn.edu or skjellum@gmail.com</font></p>
<p class="p1" style="background-color:rgb(255,255,255)"><span style="font-family:"Times New Roman",Times,serif; font-size:small">web sites:
</span><a tabindex="0" href="http://cyber.auburn.edu" target="_blank" id="NoLP">http://cyber.auburn.edu</a> <a tabindex="0" href="http://mccrary.auburn.edu" target="_blank" id="NoLP">http://mccrary.auburn.edu</a> </p>
<p class="p1" style="background-color:rgb(255,255,255)"><font face="Times New Roman, Times, serif" size="2">cell: +1-205-807-4968 ; office: +1-334-844-6360</font></p>
<p class="p1" style="font-family:Tahoma"><font size="2"><br>
</font></p>
<p class="p1" style="font-family:Tahoma"><font size="2">CONFIDENTIALITY: This e-mail and any attachments are confidential and <br>
may be privileged. If you are not a named recipient, please notify the <br>
sender immediately and do not disclose the contents to another person, <br>
use it for any purpose or store or copy the information in any medium.</font></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div style="word-wrap:break-word; color:rgb(0,0,0); font-size:14px; font-family:Calibri,sans-serif">
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Langer, Akhil <akhil.langer@intel.com><br>
<b>Sent:</b> Wednesday, May 17, 2017 4:29 PM<br>
<b>To:</b> Dan Holmes; mpiwg-coll@lists.mpi-forum.org<br>
<b>Cc:</b> mpiwg-persistence@lists.mpi-forum.org; Anthony Skjellum; htor@inf.ethz.ch; Balaji, Pavan<br>
<b>Subject:</b> Re: persistent blocking collectives</font>
<div> </div>
</div>
<div>
<div>
<div>Hi Dan,</div>
<div><br>
</div>
<div>Thanks a lot for your reply. As you suggested, we could add a MPI_Start_and_wait() call that is a blocking version of MPI_Start call. It could be used both for pt2pt and collective operations, without any additional changes.</div>
<div><br>
</div>
<div>I have noticed tangible performance difference in broadcast collective performance between the two implementations that I provided in my original email. Most of the real HPC applications still use only blocking collectives so having a blocking MPI_Start
(that is, MPI_Start<span style="font-style:italic">_</span>and_wait) call for collectives is natural. The user can simply replace the blocking collective call with MPI_<span style="font-style:italic">Start_</span>and_wait call.</div>
<div>We have also seen that blocking sends/recvs are faster than the corresponding non-blocking calls. </div>
<div><br>
</div>
<div>Please let me know what kind of information would be useful to make this succeed. I can work on this.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Akhil </div>
<div>
<div id="MAC_OUTLOOK_SIGNATURE"></div>
</div>
</div>
<div><br>
</div>
<span id="OLK_SRC_BODY_SECTION">
<div style="font-family:Calibri; font-size:12pt; text-align:left; color:black; border-bottom:medium none; border-left:medium none; padding-bottom:0in; padding-left:0in; padding-right:0in; border-top:#b5c4df 1pt solid; border-right:medium none; padding-top:3pt">
<span style="font-weight:bold">From: </span>Dan Holmes <<a href="mailto:d.holmes@epcc.ed.ac.uk">d.holmes@epcc.ed.ac.uk</a>><br>
<span style="font-weight:bold">Date: </span>Wednesday, May 17, 2017 at 5:10 AM<br>
<span style="font-weight:bold">To: </span>Akhil Langer <<a href="mailto:akhil.langer@intel.com">akhil.langer@intel.com</a>><br>
<span style="font-weight:bold">Cc: </span>"<a href="mailto:mpiwg-persistence@lists.mpi-forum.org">mpiwg-persistence@lists.mpi-forum.org</a>" <<a href="mailto:mpiwg-persistence@lists.mpi-forum.org">mpiwg-persistence@lists.mpi-forum.org</a>>, Anthony Skjellum
<<a href="mailto:skjellum@auburn.edu">skjellum@auburn.edu</a>><br>
<span style="font-weight:bold">Subject: </span>Re: persistent blocking collectives<br>
</div>
<div><br>
</div>
<div>
<div class="" style="word-wrap:break-word">Hi Akhil,
<div class=""><br class="">
</div>
<div class="">Thank you for your suggestion. This is an interesting area of API design for MPI. Let me jot down some notes in response to your points.</div>
<div class=""><br class="">
</div>
<div class="">The MPI_Start function is used by both our proposed persistent collective communications and the existing persistent point-to-point communications. For consistency in the MPI Standard, any change to MPI_Start must be applied to point-to-point
as well.</div>
<div class=""><br class="">
</div>
<div class="">Our implementation work for persistent collective communication currently leverages point-to-point communication in a similar manner to your description of the tree broadcast. However, this is not required by the MPI Standard and is known to be
a sub-optimal implementation choice. The interface design should not be determined by the needs of a poor implementation method.</div>
<div class=""><br class="">
</div>
<div class="">All schedules for persistent collective communication operations involve multiple “rounds”. Each round concludes with a dependency on one or more remote MPI processes, i.e. a “wait”. This is not the case with point-to-point, where lower latency
can be achieved with a fire-and-forget approach in some situations (ready mode or small eager protocol messages). Even for small buffer sizes, there is no ready mode or eager protocol for collective communications.</div>
<div class=""><br class="">
</div>
<div class="">There is ongoing debate about the best method for implementing “wait”, e.g. active polling (spin wait) or signals (idle wait), etc. For collective operations, the inter-round “wait” could be avoided in many cases by using triggered operations
- an incoming network packet is processed by the network hardware and triggers one or more response packets. Your “wait for receive, send to children” steps would then be “trigger store-and-foward on receive” programmed into the NIC itself. Having the CPU
blocked would be a waste of resources for this implementation. This strongly argues that nonblocking should exist in the API, even if blocking is also added. Nonblocking already exists - MPI_Start.</div>
<div class=""><br class="">
</div>
<div class="">With regards to interface naming, I would suggest MPI_Start_and_wait, and MPI_Start_and_test. You would also need to consider MPI_Startall_and_waitall and MPI_Startall_and_testall. I would avoid adding additional variants based on MPI_[Wait|Test][any|some].</div>
<div class=""><br class="">
</div>
<div class="">There has been a lengthy debate about whether the persistent collective initialisation functions could/should be blocking or nonblocking. This issue is similar. One could envisage:</div>
<div class=""><br class="">
</div>
<div class="">// fully non-blocking route - maximum opportunity for overlap - assumes normally slow network</div>
<div class="">MPI_Ireduce_init // begin optimisation of a reduction</div>
<div class="">MPI_Test // repeatedly test for completion of the optimisation of the reduction</div>
<div class=""><loop begin></div>
<div class="">MPI_Istart // begin the reduction communication</div>
<div class="">MPI_Test // repeatedly test for completion of the reduction communication</div>
<div class=""><loop end></div>
<div class="">MPI_Request_free // recover resources</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">// fully blocking route - minimum opportunity for overlap - assumes infinitely fast network</div>
<div class="">MPI_Reduce_init // optimise a reduction, blocking</div>
<div class=""><loop begin></div>
<div class="">MPI_Start // do the reduction communication, blocking</div>
<div class=""><loop end></div>
<div class="">MPI_Request_free // recover resources</div>
</div>
<div class=""><br class="">
</div>
<div class="">Some proposed optimisations take a long time and require collective communication, so we have chosen nonblocking initialisation. The current persistent communication workflow is initialise -> (start -> complete)* -> free, so we are not proposing
to have the first MPI_Test in the example above. The existing MPI_Start is nonblocking so our proposal is basically the first of the examples above. It is a minimum change to the MPI Standard to achieve our main goal, i.e. permit a planning step for collective
communications. It does not exclude or prevent additional proposals that extend the API in the manner that you suggest. However, such an extension would need a strong justification to succeed.</div>
<div class=""><br class="">
</div>
<div class="">Cheers,</div>
<div class="">Dan.</div>
<div class=""><br class="">
<div>
<blockquote type="cite" class="">
<div class="">On 16 May 2017, at 22:33, Langer, Akhil <<a href="mailto:akhil.langer@intel.com" class="">akhil.langer@intel.com</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="" style="word-wrap:break-word; font-size:14px; font-family:Calibri,sans-serif">
<div class="">Hello, </div>
<div class=""><br class="">
</div>
<div class="">I want to propose an extension to persistent API to allow a blocking MPI_Start call. Currently, MPI_Start calls are non-blocking. So, proposal is something like MPI_Start (for blocking) and MPI_Istart (for non-blocking). Of course, to maintain
backward compatibility we may have to think of an alternative API. I am not proposing the exact API here. </div>
<div class=""><br class="">
</div>
<div class="">The motivation behind the proposal is that having the knowledge whether the corresponding MPI call is blocking or not can give better performance. For example, MPI_Isend followed by MPI_Wait is slower than the MPI_Send because internally MPI_Isend->MPI_Wait
has to allocate additional data structures (for example, request pointer) and do more work. Similarly, lets look at an example of a bcast collective operation. </div>
<div class=""><br class="">
</div>
<div class="">Tree based broadcast can be implemented in two ways:</div>
<ol class="">
<li class="">MPI_Recv (recv data from parent) -> FOREACHCHILD – MPI_Send (send data to children)</li><li class="">MPI_Irecv (recv data from parent) -> MPI_Wait(wait for recv to complete) -> FOREACHCHILD – MPI_Isend (send data to childrent) -> MPI_WaitAll (wait for sends to complete)</li></ol>
<div class="">Having only a non-blocking MPI_Start call forces only implementation 2 as implementation 1 has blocking MPI calls. However, implementation 1 can be significantly faster that implementation 2 for small message sizes.</div>
<div class=""><br class="">
</div>
<div class="">Looking forward to hear your feedback.</div>
<div class=""><br class="">
</div>
<div class="">Thanks,</div>
<div class="">Akhil </div>
<div class=""><br class="">
</div>
<div class="">
<div id="" class=""></div>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</div>
</span></div>
</div>
</body>
</html>