[mpiwg-persistence] persistent blocking collectives

Wed May 17 17:07:33 CDT 2017

We succeeded with 15-20 year old cores :-) in overlapping :-)

We will share the paper when done.

Anthony Skjellum, PhD
Professor of Computer Science and Software Engineering and
    Charles D. McCrary Eminent Scholar Endowed Chair
Director of the Charles D. McCrary Institute
Samuel Ginn College of Engineering

Auburn University
e-mail: skjellum at auburn.edu or skjellum at gmail.com

web sites: http://cyber.auburn.edu     http://mccrary.auburn.edu

cell: +1-205-807-4968 ; office: +1-334-844-6360

CONFIDENTIALITY: This e-mail and any attachments are confidential and
may be privileged. If you are not a named recipient, please notify the
sender immediately and do not disclose the contents to another person,
use it for any purpose or store or copy the information in any medium.

________________________________
From: Langer, Akhil <akhil.langer at intel.com>
Sent: Wednesday, May 17, 2017 5:06 PM
To: Anthony Skjellum; Dan Holmes; mpiwg-coll at lists.mpi-forum.org
Cc: mpiwg-persistence at lists.mpi-forum.org; htor at inf.ethz.ch; Balaji, Pavan
Subject: Re: persistent blocking collectives

Hi Tony,

I agree that non-blocking MPI_Start is required.
If possible, can you please point me to the paper. With many-core architectures that have slower cores, difference in blocking vs non-blocking send/recv calls can be more tangible than it might/might not be with architectures that have faster cores.

Thanks,
Akhil

From: Anthony Skjellum <skjellum at auburn.edu<mailto:skjellum at auburn.edu>>
Date: Wednesday, May 17, 2017 at 4:40 PM
To: Akhil Langer <akhil.langer at intel.com<mailto:akhil.langer at intel.com>>, Dan Holmes <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>>, "mpiwg-coll at lists.mpi-forum.org<mailto:mpiwg-coll at lists.mpi-forum.org>" <mpiwg-coll at lists.mpi-forum.org<mailto:mpiwg-coll at lists.mpi-forum.org>>
Cc: "mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>" <mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>>, "htor at inf.ethz.ch<mailto:htor at inf.ethz.ch>" <htor at inf.ethz.ch<mailto:htor at inf.ethz.ch>>, "Balaji, Pavan" <balaji at anl.gov<mailto:balaji at anl.gov>>
Subject: Re: persistent blocking collectives

We have data associated with our first persistent collective paper that show no significant advantage to  blocking collective over nonblocking vs. persistent, even though we haven't optimized persistent a lot yet.

MPI's with strong progress can give you more benefits for long transfers, provided there is a good implementation and sufficient memory bandwidth, and you have something to do between Start and Wait...

we had success with point-to-point-based strong progress and overlap over 15 years ago... only for really short message applications did we want polling progress or progress only at wait....

Tony

Anthony Skjellum, PhD
Professor of Computer Science and Software Engineering and
    Charles D. McCrary Eminent Scholar Endowed Chair
Director of the Charles D. McCrary Institute
Samuel Ginn College of Engineering

Auburn University
e-mail: skjellum at auburn.edu<mailto:skjellum at auburn.edu> or skjellum at gmail.com<mailto:skjellum at gmail.com>

web sites: http://cyber.auburn.edu     http://mccrary.auburn.edu

cell: +1-205-807-4968 ; office: +1-334-844-6360

CONFIDENTIALITY: This e-mail and any attachments are confidential and
may be privileged. If you are not a named recipient, please notify the
sender immediately and do not disclose the contents to another person,
use it for any purpose or store or copy the information in any medium.

________________________________
From: Langer, Akhil <akhil.langer at intel.com<mailto:akhil.langer at intel.com>>
Sent: Wednesday, May 17, 2017 4:29 PM
To: Dan Holmes; mpiwg-coll at lists.mpi-forum.org<mailto:mpiwg-coll at lists.mpi-forum.org>
Cc: mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>; Anthony Skjellum; htor at inf.ethz.ch<mailto:htor at inf.ethz.ch>; Balaji, Pavan
Subject: Re: persistent blocking collectives

Hi Dan,

Thanks a lot for your reply. As you suggested, we could add a MPI_Start_and_wait() call that is a blocking version of MPI_Start call. It could be used both for pt2pt and collective operations, without any additional changes.

I have noticed tangible performance difference in broadcast collective performance between the two implementations that I provided in my original email. Most of the real HPC applications still use only blocking collectives so having a blocking MPI_Start (that is, MPI_Start_and_wait) call for collectives is natural. The user can simply replace the blocking collective call with MPI_Start_and_wait call.
We have also seen that blocking sends/recvs are faster than the corresponding non-blocking calls.

Please let me know what kind of information would be useful to make this succeed. I can work on this.

Thanks,
Akhil

From: Dan Holmes <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>>
Date: Wednesday, May 17, 2017 at 5:10 AM
To: Akhil Langer <akhil.langer at intel.com<mailto:akhil.langer at intel.com>>
Cc: "mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>" <mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>>, Anthony Skjellum <skjellum at auburn.edu<mailto:skjellum at auburn.edu>>
Subject: Re: persistent blocking collectives

Hi Akhil,

Thank you for your suggestion. This is an interesting area of API design for MPI. Let me jot down some notes in response to your points.

The MPI_Start function is used by both our proposed persistent collective communications and the existing persistent point-to-point communications. For consistency in the MPI Standard, any change to MPI_Start must be applied to point-to-point as well.

Our implementation work for persistent collective communication currently leverages point-to-point communication in a similar manner to your description of the tree broadcast. However, this is not required by the MPI Standard and is known to be a sub-optimal implementation choice. The interface design should not be determined by the needs of a poor implementation method.

All schedules for persistent collective communication operations involve multiple “rounds”. Each round concludes with a dependency on one or more remote MPI processes, i.e. a “wait”. This is not the case with point-to-point, where lower latency can be achieved with a fire-and-forget approach in some situations (ready mode or small eager protocol messages). Even for small buffer sizes, there is no ready mode or eager protocol for collective communications.

There is ongoing debate about the best method for implementing “wait”, e.g. active polling (spin wait) or signals (idle wait), etc. For collective operations, the inter-round “wait” could be avoided in many cases by using triggered operations - an incoming network packet is processed by the network hardware and triggers one or more response packets. Your “wait for receive, send to children” steps would then be “trigger store-and-foward on receive” programmed into the NIC itself. Having the CPU blocked would be a waste of resources for this implementation. This strongly argues that nonblocking should exist in the API, even if blocking is also added. Nonblocking already exists - MPI_Start.

With regards to interface naming, I would suggest MPI_Start_and_wait, and MPI_Start_and_test. You would also need to consider MPI_Startall_and_waitall and MPI_Startall_and_testall. I would avoid adding additional variants based on MPI_[Wait|Test][any|some].

There has been a lengthy debate about whether the persistent collective initialisation functions could/should be blocking or nonblocking. This issue is similar. One could envisage:

// fully non-blocking route - maximum opportunity for overlap - assumes normally slow network
MPI_Ireduce_init // begin optimisation of a reduction
MPI_Test // repeatedly test for completion of the optimisation of the reduction
<loop begin>
MPI_Istart // begin the reduction communication
MPI_Test // repeatedly test for completion of the reduction communication
<loop end>
MPI_Request_free // recover resources

// fully blocking route - minimum opportunity for overlap - assumes infinitely fast network
MPI_Reduce_init // optimise a reduction, blocking
<loop begin>
MPI_Start // do the reduction communication, blocking
<loop end>
MPI_Request_free // recover resources

Some proposed optimisations take a long time and require collective communication, so we have chosen nonblocking initialisation. The current persistent communication workflow is initialise -> (start -> complete)* -> free, so we are not proposing to have the first MPI_Test in the example above. The existing MPI_Start is nonblocking so our proposal is basically the first of the examples above. It is a minimum change to the MPI Standard to achieve our main goal, i.e. permit a planning step for collective communications. It does not exclude or prevent additional proposals that extend the API in the manner that you suggest. However, such an extension would need a strong justification to succeed.

Cheers,
Dan.

On 16 May 2017, at 22:33, Langer, Akhil <akhil.langer at intel.com<mailto:akhil.langer at intel.com>> wrote:

Hello,

I want to propose an extension to persistent API to allow a blocking MPI_Start call. Currently, MPI_Start calls are non-blocking. So, proposal is something like MPI_Start (for blocking) and MPI_Istart (for non-blocking). Of course, to maintain backward compatibility we may have to think of an alternative API. I am not proposing the exact API here.

The motivation behind the proposal is that having the knowledge whether the corresponding MPI call is blocking or not can give better performance. For example, MPI_Isend followed by MPI_Wait is slower than the MPI_Send because internally MPI_Isend->MPI_Wait has to allocate additional data structures (for example, request pointer) and do more work. Similarly, lets look at an example of a bcast collective operation.

Tree based broadcast can be implemented in two ways:

  1.  MPI_Recv (recv data from parent) -> FOREACHCHILD – MPI_Send (send data to children)
  2.  MPI_Irecv (recv data from  parent) -> MPI_Wait(wait for recv to complete) -> FOREACHCHILD – MPI_Isend (send data to childrent) -> MPI_WaitAll (wait for sends to complete)

Having only a non-blocking MPI_Start call forces only implementation 2 as implementation 1 has blocking MPI calls. However, implementation 1 can be significantly faster that implementation 2 for small message sizes.

Looking forward to hear your feedback.

Thanks,
Akhil

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-persistence/attachments/20170517/519b7468/attachment-0001.html>