[mpiwg-coll] persistent blocking collectives

Thu May 18 01:13:44 CDT 2017

All,

I agree --- this seems odd. Furthermore, consider that one can implement 
persistent collectives without changes to the standard using info 
objects that change communicators (similar to the use of info in MPI IO) 
today. In fact, it allows to combine nonblocking and blocking invocation 
as well as different persistence levels (only topology and topology and 
sizes). So it should cover the whole space.

It's documented in a paper at SC12 (and also shows some performance 
advantages even though it could be tuned more): 
http://htor.inf.ethz.ch/publications/img/hoefler-schneider-neighbor-colls.pdf 
(Section II).

Best,
	Torsten

On 05/18/2017 01:12 AM, Jeff Hammond wrote:
> MPI_Start_and_wait makes only slightly more sense than
> MPI_Init_and_finalize.  Can someone please show me data that justifies
> breaking the orthogonality of these functions that has existed in MPI
> for many years?  How many cycles are saved by implementing this function
> instead of Dan's following proposal?
>
> int MPI_Start(…) { // no op }
> int MPI_Wait(…) { MPIX_Start_and_wait (…); }
>
> Jeff
>
> On Wed, May 17, 2017 at 3:35 PM, Dan Holmes <d.holmes at epcc.ed.ac.uk
> <mailto:d.holmes at epcc.ed.ac.uk>> wrote:
>
>     Hi Akhil,
>
>     A legal implementation of MPIX_Start_and_wait would be (pseudo-code):
>     int MPIX_Start_and_wait(…) { MPI_Start(…); MPI_Wait(…); }
>     Adding the interface is not sufficient to force a good
>     implementation of that interface.
>
>     On the other hand, a legal implementation of MPI_Start -> MPI_Wait
>     would be (pseudo-code):
>     int MPI_Start(…) { // no op }
>     int MPI_Wait(…) { MPIX_Start_and_wait (…); }
>     If a good implementation of the new interface existed (better than
>     nonblocking start), then it could be used to implement the existing
>     API and there would be (almost) zero performance gain from using the
>     new API.
>
>     It could be argued that the additional API change is neither
>     necessary nor sufficient for the performance improvement.
>     Justification for this new extension would have to rely on semantics
>     - is there something that can be done with the new interface that
>     cannot be done with the old one?
>
>     Cheers,
>     Dan.
>
>>     On 17 May 2017, at 23:07, Anthony Skjellum <skjellum at auburn.edu
>>     <mailto:skjellum at auburn.edu>> wrote:
>>
>>     We succeeded with 15-20 year old cores :-) in overlapping :-)
>>
>>     We will share the paper when done.
>>
>>
>>     Anthony Skjellum, PhD
>>     Professor of Computer Science and Software Engineering and
>>         Charles D. McCrary Eminent Scholar Endowed Chair
>>     Director of the Charles D. McCrary Institute
>>     Samuel Ginn College of Engineering
>>     Auburn University
>>     e-mail: skjellum at auburn.edu
>>     <mailto:skjellum at auburn.edu> or skjellum at gmail.com
>>     <mailto:skjellum at gmail.com>
>>     web sites: http://cyber.auburn.edu <http://cyber.auburn.edu/>
>>      http://mccrary.auburn.edu <http://mccrary.auburn.edu/>
>>     cell: +1-205-807-4968 ; office: +1-334-844-6360
>>
>>     CONFIDENTIALITY: This e-mail and any attachments are confidential and
>>     may be privileged. If you are not a named recipient, please notify
>>     the
>>     sender immediately and do not disclose the contents to another
>>     person,
>>     use it for any purpose or store or copy the information in any medium.
>>     ------------------------------------------------------------------------
>>     *From:* Langer, Akhil <akhil.langer at intel.com
>>     <mailto:akhil.langer at intel.com>>
>>     *Sent:* Wednesday, May 17, 2017 5:06 PM
>>     *To:* Anthony Skjellum; Dan Holmes; mpiwg-coll at lists.mpi-forum.org
>>     <mailto:mpiwg-coll at lists.mpi-forum.org>
>>     *Cc:* mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>; htor at inf.ethz.ch
>>     <mailto:htor at inf.ethz.ch>; Balaji, Pavan
>>
>>     *Subject:* Re: persistent blocking collectives
>>
>>     Hi Tony,
>>
>>     I agree that non-blocking MPI_Start is required.
>>     If possible, can you please point me to the paper. With many-core
>>     architectures that have slower cores, difference in blocking vs
>>     non-blocking send/recv calls can be more tangible than it
>>     might/might not be with architectures that have faster cores.
>>
>>     Thanks,
>>     Akhil
>>
>>     From: Anthony Skjellum <skjellum at auburn.edu
>>     <mailto:skjellum at auburn.edu>>
>>     Date: Wednesday, May 17, 2017 at 4:40 PM
>>     To: Akhil Langer <akhil.langer at intel.com
>>     <mailto:akhil.langer at intel.com>>, Dan Holmes
>>     <d.holmes at epcc.ed.ac.uk <mailto:d.holmes at epcc.ed.ac.uk>>,
>>     "mpiwg-coll at lists.mpi-forum.org
>>     <mailto:mpiwg-coll at lists.mpi-forum.org>"
>>     <mpiwg-coll at lists.mpi-forum.org
>>     <mailto:mpiwg-coll at lists.mpi-forum.org>>
>>     Cc: "mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>"
>>     <mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>>, "htor at inf.ethz.ch
>>     <mailto:htor at inf.ethz.ch>" <htor at inf.ethz.ch
>>     <mailto:htor at inf.ethz.ch>>, "Balaji, Pavan" <balaji at anl.gov
>>     <mailto:balaji at anl.gov>>
>>     Subject: Re: persistent blocking collectives
>>
>>     We have data associated with our first persistent collective paper
>>     that show no significant advantage to  blocking collective over
>>     nonblocking vs. persistent, even though we haven't optimized
>>     persistent a lot yet.
>>
>>     MPI's with strong progress can give you more benefits for long
>>     transfers, provided there is a good implementation and sufficient
>>     memory bandwidth, and you have something to do between Start and
>>     Wait...
>>     we had success with point-to-point-based strong progress and
>>     overlap over 15 years ago... only for really short message
>>     applications did we want polling progress or progress only at wait....
>>
>>     Tony
>>
>>
>>     Anthony Skjellum, PhD
>>     Professor of Computer Science and Software Engineering and
>>         Charles D. McCrary Eminent Scholar Endowed Chair
>>     Director of the Charles D. McCrary Institute
>>     Samuel Ginn College of Engineering
>>     Auburn University
>>     e-mail: skjellum at auburn.edu
>>     <mailto:skjellum at auburn.edu> or skjellum at gmail.com
>>     <mailto:skjellum at gmail.com>
>>     web sites: http://cyber.auburn.edu <http://cyber.auburn.edu/>
>>      http://mccrary.auburn.edu <http://mccrary.auburn.edu/>
>>     cell: +1-205-807-4968 ; office: +1-334-844-6360
>>
>>     CONFIDENTIALITY: This e-mail and any attachments are confidential and
>>     may be privileged. If you are not a named recipient, please notify
>>     the
>>     sender immediately and do not disclose the contents to another
>>     person,
>>     use it for any purpose or store or copy the information in any medium.
>>     ------------------------------------------------------------------------
>>     *From:* Langer, Akhil <akhil.langer at intel.com
>>     <mailto:akhil.langer at intel.com>>
>>     *Sent:* Wednesday, May 17, 2017 4:29 PM
>>     *To:* Dan Holmes; mpiwg-coll at lists.mpi-forum.org
>>     <mailto:mpiwg-coll at lists.mpi-forum.org>
>>     *Cc:* mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>; Anthony
>>     Skjellum; htor at inf.ethz.ch <mailto:htor at inf.ethz.ch>; Balaji, Pavan
>>     *Subject:* Re: persistent blocking collectives
>>
>>     Hi Dan,
>>
>>     Thanks a lot for your reply. As you suggested, we could add a
>>     MPI_Start_and_wait() call that is a blocking version of MPI_Start
>>     call. It could be used both for pt2pt and collective operations,
>>     without any additional changes.
>>
>>     I have noticed tangible performance difference in broadcast
>>     collective performance between the two implementations that I
>>     provided in my original email. Most of the real HPC applications
>>     still use only blocking collectives so having a blocking MPI_Start
>>     (that is, MPI_Start_and_wait) call for collectives is natural. The
>>     user can simply replace the blocking collective call with
>>     MPI_Start_and_wait call.
>>     We have also seen that blocking sends/recvs are faster than the
>>     corresponding non-blocking calls.
>>
>>     Please let me know what kind of information would be useful to
>>     make this succeed. I can work on this.
>>
>>     Thanks,
>>     Akhil
>>
>>     From: Dan Holmes <d.holmes at epcc.ed.ac.uk
>>     <mailto:d.holmes at epcc.ed.ac.uk>>
>>     Date: Wednesday, May 17, 2017 at 5:10 AM
>>     To: Akhil Langer <akhil.langer at intel.com
>>     <mailto:akhil.langer at intel.com>>
>>     Cc: "mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>"
>>     <mpiwg-persistence at lists.mpi-forum.org
>>     <mailto:mpiwg-persistence at lists.mpi-forum.org>>, Anthony Skjellum
>>     <skjellum at auburn.edu <mailto:skjellum at auburn.edu>>
>>     Subject: Re: persistent blocking collectives
>>
>>     Hi Akhil,
>>
>>     Thank you for your suggestion. This is an interesting area of API
>>     design for MPI. Let me jot down some notes in response to your points.
>>
>>     The MPI_Start function is used by both our proposed persistent
>>     collective communications and the existing persistent
>>     point-to-point communications. For consistency in the MPI
>>     Standard, any change to MPI_Start must be applied to
>>     point-to-point as well.
>>
>>     Our implementation work for persistent collective communication
>>     currently leverages point-to-point communication in a similar
>>     manner to your description of the tree broadcast. However, this is
>>     not required by the MPI Standard and is known to be a sub-optimal
>>     implementation choice. The interface design should not be
>>     determined by the needs of a poor implementation method.
>>
>>     All schedules for persistent collective communication operations
>>     involve multiple “rounds”. Each round concludes with a dependency
>>     on one or more remote MPI processes, i.e. a “wait”. This is not
>>     the case with point-to-point, where lower latency can be achieved
>>     with a fire-and-forget approach in some situations (ready mode or
>>     small eager protocol messages). Even for small buffer sizes, there
>>     is no ready mode or eager protocol for collective communications.
>>
>>     There is ongoing debate about the best method for implementing
>>     “wait”, e.g. active polling (spin wait) or signals (idle wait),
>>     etc. For collective operations, the inter-round “wait” could be
>>     avoided in many cases by using triggered operations - an incoming
>>     network packet is processed by the network hardware and triggers
>>     one or more response packets. Your “wait for receive, send to
>>     children” steps would then be “trigger store-and-foward on
>>     receive” programmed into the NIC itself. Having the CPU blocked
>>     would be a waste of resources for this implementation. This
>>     strongly argues that nonblocking should exist in the API, even if
>>     blocking is also added. Nonblocking already exists - MPI_Start.
>>
>>     With regards to interface naming, I would suggest
>>     MPI_Start_and_wait, and MPI_Start_and_test. You would also need to
>>     consider MPI_Startall_and_waitall and MPI_Startall_and_testall. I
>>     would avoid adding additional variants based on
>>     MPI_[Wait|Test][any|some].
>>
>>     There has been a lengthy debate about whether the persistent
>>     collective initialisation functions could/should be blocking or
>>     nonblocking. This issue is similar. One could envisage:
>>
>>     // fully non-blocking route - maximum opportunity for overlap -
>>     assumes normally slow network
>>     MPI_Ireduce_init // begin optimisation of a reduction
>>     MPI_Test // repeatedly test for completion of the optimisation of
>>     the reduction
>>     <loop begin>
>>     MPI_Istart // begin the reduction communication
>>     MPI_Test // repeatedly test for completion of the reduction
>>     communication
>>     <loop end>
>>     MPI_Request_free // recover resources
>>
>>     // fully blocking route - minimum opportunity for overlap -
>>     assumes infinitely fast network
>>     MPI_Reduce_init // optimise a reduction, blocking
>>     <loop begin>
>>     MPI_Start // do the reduction communication, blocking
>>     <loop end>
>>     MPI_Request_free // recover resources
>>
>>     Some proposed optimisations take a long time and require
>>     collective communication, so we have chosen nonblocking
>>     initialisation. The current persistent communication workflow is
>>     initialise -> (start -> complete)* -> free, so we are not
>>     proposing to have the first MPI_Test in the example above. The
>>     existing MPI_Start is nonblocking so our proposal is basically the
>>     first of the examples above. It is a minimum change to the MPI
>>     Standard to achieve our main goal, i.e. permit a planning step for
>>     collective communications. It does not exclude or prevent
>>     additional proposals that extend the API in the manner that you
>>     suggest. However, such an extension would need a strong
>>     justification to succeed.
>>
>>     Cheers,
>>     Dan.
>>
>>>     On 16 May 2017, at 22:33, Langer, Akhil <akhil.langer at intel.com
>>>     <mailto:akhil.langer at intel.com>> wrote:
>>>
>>>     Hello,
>>>
>>>     I want to propose an extension to persistent API to allow a
>>>     blocking MPI_Start call. Currently, MPI_Start calls are
>>>     non-blocking. So, proposal is something like MPI_Start (for
>>>     blocking) and MPI_Istart (for non-blocking). Of course, to
>>>     maintain backward compatibility we may have to think of an
>>>     alternative API. I am not proposing the exact API here.
>>>
>>>     The motivation behind the proposal is that having the knowledge
>>>     whether the corresponding MPI call is blocking or not can give
>>>     better performance. For example, MPI_Isend followed by MPI_Wait
>>>     is slower than the MPI_Send because internally
>>>     MPI_Isend->MPI_Wait has to allocate additional data structures
>>>     (for example, request pointer) and do more work. Similarly, lets
>>>     look at an example of a bcast collective operation.
>>>
>>>     Tree based broadcast can be implemented in two ways:
>>>
>>>      1. MPI_Recv (recv data from parent) -> FOREACHCHILD – MPI_Send
>>>         (send data to children)
>>>      2. MPI_Irecv (recv data from  parent) -> MPI_Wait(wait for recv
>>>         to complete) -> FOREACHCHILD – MPI_Isend (send data to
>>>         childrent) -> MPI_WaitAll (wait for sends to complete)
>>>
>>>     Having only a non-blocking MPI_Start call forces only
>>>     implementation 2 as implementation 1 has blocking MPI calls.
>>>     However, implementation 1 can be significantly faster that
>>>     implementation 2 for small message sizes.
>>>
>>>     Looking forward to hear your feedback.
>>>
>>>     Thanks,
>>>     Akhil
>>>
>>
>>
>
>
>     The University of Edinburgh is a charitable body, registered in
>     Scotland, with registration number SC005336.
>
>     _______________________________________________
>     mpiwg-coll mailing list
>     mpiwg-coll at lists.mpi-forum.org <mailto:mpiwg-coll at lists.mpi-forum.org>
>     https://lists.mpi-forum.org/mailman/listinfo/mpiwg-coll
>     <https://lists.mpi-forum.org/mailman/listinfo/mpiwg-coll>
>
>
>
>
> --
> Jeff Hammond
> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
> http://jeffhammond.github.io/

-- 
### qreharg rug ebs fv crryF ---------- http://htor.inf.ethz.ch/ -----
Torsten Hoefler           | Assistant Professor
Dept. of Computer Science | ETH Zürich
Universitätsstrasse 6     | Zurich-8092, Switzerland
CAB F 75                  | Phone: +41 44 632 63 44