[mpiwg-persistence] [mpiwg-coll] persistent blocking collectives

Jeff Hammond jeff.science at gmail.com
Wed May 17 18:12:04 CDT 2017


MPI_Start_and_wait makes only slightly more sense than
MPI_Init_and_finalize.  Can someone please show me data that justifies
breaking the orthogonality of these functions that has existed in MPI for
many years?  How many cycles are saved by implementing this function
instead of Dan's following proposal?

int MPI_Start(…) { // no op }
int MPI_Wait(…) { MPIX_Start_and_wait (…); }

Jeff

On Wed, May 17, 2017 at 3:35 PM, Dan Holmes <d.holmes at epcc.ed.ac.uk> wrote:

> Hi Akhil,
>
> A legal implementation of MPIX_Start_and_wait would be (pseudo-code):
> int MPIX_Start_and_wait(…) { MPI_Start(…); MPI_Wait(…); }
> Adding the interface is not sufficient to force a good implementation of
> that interface.
>
>
> On the other hand, a legal implementation of MPI_Start -> MPI_Wait would
> be (pseudo-code):
> int MPI_Start(…) { // no op }
> int MPI_Wait(…) { MPIX_Start_and_wait (…); }
> If a good implementation of the new interface existed (better than
> nonblocking start), then it could be used to implement the existing API and
> there would be (almost) zero performance gain from using the new API.
>
> It could be argued that the additional API change is neither necessary nor
> sufficient for the performance improvement. Justification for this new
> extension would have to rely on semantics - is there something that can be
> done with the new interface that cannot be done with the old one?
>
> Cheers,
> Dan.
>
> On 17 May 2017, at 23:07, Anthony Skjellum <skjellum at auburn.edu> wrote:
>
> ​We succeeded with 15-20 year old cores :-) in overlapping :-)
>
> We will share the paper when done.
>
>
> Anthony Skjellum, PhD
> Professor of Computer Science and Software Engineering and
>     Charles D. McCrary Eminent Scholar Endowed Chair
> Director of the Charles D. McCrary Institute
> Samuel Ginn College of Engineering
> Auburn University
> e-mail: skjellum at auburn.edu or skjellum at gmail.com
> web sites: http://cyber.auburn.edu     http://mccrary.auburn.edu
> cell: +1-205-807-4968 ; office: +1-334-844-6360
>
> CONFIDENTIALITY: This e-mail and any attachments are confidential and
> may be privileged. If you are not a named recipient, please notify the
> sender immediately and do not disclose the contents to another person,
> use it for any purpose or store or copy the information in any medium.
> ------------------------------
> *From:* Langer, Akhil <akhil.langer at intel.com>
> *Sent:* Wednesday, May 17, 2017 5:06 PM
> *To:* Anthony Skjellum; Dan Holmes; mpiwg-coll at lists.mpi-forum.org
> *Cc:* mpiwg-persistence at lists.mpi-forum.org; htor at inf.ethz.ch; Balaji,
> Pavan
>
> *Subject:* Re: persistent blocking collectives
>
> Hi Tony,
>
> I agree that non-blocking MPI_Start is required.
> If possible, can you please point me to the paper. With many-core
> architectures that have slower cores, difference in blocking vs
> non-blocking send/recv calls can be more tangible than it might/might not
> be with architectures that have faster cores.
>
> Thanks,
> Akhil
>
> From: Anthony Skjellum <skjellum at auburn.edu>
> Date: Wednesday, May 17, 2017 at 4:40 PM
> To: Akhil Langer <akhil.langer at intel.com>, Dan Holmes <
> d.holmes at epcc.ed.ac.uk>, "mpiwg-coll at lists.mpi-forum.org" <
> mpiwg-coll at lists.mpi-forum.org>
> Cc: "mpiwg-persistence at lists.mpi-forum.org" <mpiwg-persistence at lists.mpi-
> forum.org>, "htor at inf.ethz.ch" <htor at inf.ethz.ch>, "Balaji, Pavan" <
> balaji at anl.gov>
> Subject: Re: persistent blocking collectives
>
> We have data associated with our first persistent collective paper that
> show no significant advantage to  blocking collective over nonblocking vs.
> persistent, even though we haven't optimized persistent a lot yet.
>
> MPI's with strong progress can give you more benefits for long transfers,
> provided there is a good implementation and sufficient memory bandwidth,
> and you have something to do between Start and Wait...
> we had success with point-to-point-based strong progress and overlap over
> 15 years ago... only for really short message applications did we want
> polling progress or progress only at wait....
>
> Tony
>
>
> Anthony Skjellum, PhD
> Professor of Computer Science and Software Engineering and
>     Charles D. McCrary Eminent Scholar Endowed Chair
> Director of the Charles D. McCrary Institute
> Samuel Ginn College of Engineering
> Auburn University
> e-mail: skjellum at auburn.edu or skjellum at gmail.com
> web sites: http://cyber.auburn.edu     http://mccrary.auburn.edu
> cell: +1-205-807-4968 ; office: +1-334-844-6360
>
> CONFIDENTIALITY: This e-mail and any attachments are confidential and
> may be privileged. If you are not a named recipient, please notify the
> sender immediately and do not disclose the contents to another person,
> use it for any purpose or store or copy the information in any medium.
> ------------------------------
> *From:* Langer, Akhil <akhil.langer at intel.com>
> *Sent:* Wednesday, May 17, 2017 4:29 PM
> *To:* Dan Holmes; mpiwg-coll at lists.mpi-forum.org
> *Cc:* mpiwg-persistence at lists.mpi-forum.org; Anthony Skjellum;
> htor at inf.ethz.ch; Balaji, Pavan
> *Subject:* Re: persistent blocking collectives
>
> Hi Dan,
>
> Thanks a lot for your reply. As you suggested, we could add a
> MPI_Start_and_wait() call that is a blocking version of MPI_Start call. It
> could be used both for pt2pt and collective operations, without any
> additional changes.
>
> I have noticed tangible performance difference in broadcast collective
> performance between the two implementations that I provided in my original
> email. Most of the real HPC applications still use only blocking
> collectives so having a blocking MPI_Start (that is, MPI_Start_and_wait)
> call for collectives is natural. The user can simply replace the blocking
> collective call with MPI_Start_and_wait call.
> We have also seen that blocking sends/recvs are faster than the
> corresponding non-blocking calls.
>
> Please let me know what kind of information would be useful to make this
> succeed. I can work on this.
>
> Thanks,
> Akhil
>
> From: Dan Holmes <d.holmes at epcc.ed.ac.uk>
> Date: Wednesday, May 17, 2017 at 5:10 AM
> To: Akhil Langer <akhil.langer at intel.com>
> Cc: "mpiwg-persistence at lists.mpi-forum.org" <mpiwg-persistence at lists.mpi-
> forum.org>, Anthony Skjellum <skjellum at auburn.edu>
> Subject: Re: persistent blocking collectives
>
> Hi Akhil,
>
> Thank you for your suggestion. This is an interesting area of API design
> for MPI. Let me jot down some notes in response to your points.
>
> The MPI_Start function is used by both our proposed persistent collective
> communications and the existing persistent point-to-point communications.
> For consistency in the MPI Standard, any change to MPI_Start must be
> applied to point-to-point as well.
>
> Our implementation work for persistent collective communication currently
> leverages point-to-point communication in a similar manner to your
> description of the tree broadcast. However, this is not required by the MPI
> Standard and is known to be a sub-optimal implementation choice. The
> interface design should not be determined by the needs of a poor
> implementation method.
>
> All schedules for persistent collective communication operations involve
> multiple “rounds”. Each round concludes with a dependency on one or more
> remote MPI processes, i.e. a “wait”. This is not the case with
> point-to-point, where lower latency can be achieved with a fire-and-forget
> approach in some situations (ready mode or small eager protocol messages).
> Even for small buffer sizes, there is no ready mode or eager protocol for
> collective communications.
>
> There is ongoing debate about the best method for implementing “wait”,
> e.g. active polling (spin wait) or signals (idle wait), etc. For collective
> operations, the inter-round “wait” could be avoided in many cases by using
> triggered operations - an incoming network packet is processed by the
> network hardware and triggers one or more response packets. Your “wait for
> receive, send to children” steps would then be “trigger store-and-foward on
> receive” programmed into the NIC itself. Having the CPU blocked would be a
> waste of resources for this implementation. This strongly argues that
> nonblocking should exist in the API, even if blocking is also added.
> Nonblocking already exists - MPI_Start.
>
> With regards to interface naming, I would suggest MPI_Start_and_wait, and
> MPI_Start_and_test. You would also need to consider
> MPI_Startall_and_waitall and MPI_Startall_and_testall. I would avoid adding
> additional variants based on MPI_[Wait|Test][any|some].
>
> There has been a lengthy debate about whether the persistent collective
> initialisation functions could/should be blocking or nonblocking. This
> issue is similar. One could envisage:
>
> // fully non-blocking route - maximum opportunity for overlap - assumes
> normally slow network
> MPI_Ireduce_init // begin optimisation of a reduction
> MPI_Test // repeatedly test for completion of the optimisation of the
> reduction
> <loop begin>
> MPI_Istart // begin the reduction communication
> MPI_Test // repeatedly test for completion of the reduction communication
> <loop end>
> MPI_Request_free // recover resources
>
> // fully blocking route - minimum opportunity for overlap - assumes
> infinitely fast network
> MPI_Reduce_init // optimise a reduction, blocking
> <loop begin>
> MPI_Start // do the reduction communication, blocking
> <loop end>
> MPI_Request_free // recover resources
>
> Some proposed optimisations take a long time and require collective
> communication, so we have chosen nonblocking initialisation. The current
> persistent communication workflow is initialise -> (start -> complete)* ->
> free, so we are not proposing to have the first MPI_Test in the example
> above. The existing MPI_Start is nonblocking so our proposal is basically
> the first of the examples above. It is a minimum change to the MPI Standard
> to achieve our main goal, i.e. permit a planning step for collective
> communications. It does not exclude or prevent additional proposals that
> extend the API in the manner that you suggest. However, such an extension
> would need a strong justification to succeed.
>
> Cheers,
> Dan.
>
> On 16 May 2017, at 22:33, Langer, Akhil <akhil.langer at intel.com> wrote:
>
> Hello,
>
> I want to propose an extension to persistent API to allow a blocking
> MPI_Start call. Currently, MPI_Start calls are non-blocking. So, proposal
> is something like MPI_Start (for blocking) and MPI_Istart (for
> non-blocking). Of course, to maintain backward compatibility we may have to
> think of an alternative API. I am not proposing the exact API here.
>
> The motivation behind the proposal is that having the knowledge whether
> the corresponding MPI call is blocking or not can give better performance.
> For example, MPI_Isend followed by MPI_Wait is slower than the MPI_Send
> because internally MPI_Isend->MPI_Wait has to allocate additional data
> structures (for example, request pointer) and do more work. Similarly, lets
> look at an example of a bcast collective operation.
>
> Tree based broadcast can be implemented in two ways:
>
>    1. MPI_Recv (recv data from parent) -> FOREACHCHILD – MPI_Send (send
>    data to children)
>    2. MPI_Irecv (recv data from  parent) -> MPI_Wait(wait for recv to
>    complete) -> FOREACHCHILD – MPI_Isend (send data to childrent) ->
>    MPI_WaitAll (wait for sends to complete)
>
> Having only a non-blocking MPI_Start call forces only implementation 2 as
> implementation 1 has blocking MPI calls. However, implementation 1 can be
> significantly faster that implementation 2 for small message sizes.
>
> Looking forward to hear your feedback.
>
> Thanks,
> Akhil
>
>
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
> _______________________________________________
> mpiwg-coll mailing list
> mpiwg-coll at lists.mpi-forum.org
> https://lists.mpi-forum.org/mailman/listinfo/mpiwg-coll
>



-- 
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-persistence/attachments/20170517/62c86ece/attachment-0001.html>


More information about the mpiwg-persistence mailing list