[mpiwg-coll] persistent blocking collectives
akhil.langer at intel.com
Wed May 17 16:29:31 CDT 2017
Thanks a lot for your reply. As you suggested, we could add a MPI_Start_and_wait() call that is a blocking version of MPI_Start call. It could be used both for pt2pt and collective operations, without any additional changes.
I have noticed tangible performance difference in broadcast collective performance between the two implementations that I provided in my original email. Most of the real HPC applications still use only blocking collectives so having a blocking MPI_Start (that is, MPI_Start_and_wait) call for collectives is natural. The user can simply replace the blocking collective call with MPI_Start_and_wait call.
We have also seen that blocking sends/recvs are faster than the corresponding non-blocking calls.
Please let me know what kind of information would be useful to make this succeed. I can work on this.
From: Dan Holmes <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>>
Date: Wednesday, May 17, 2017 at 5:10 AM
To: Akhil Langer <akhil.langer at intel.com<mailto:akhil.langer at intel.com>>
Cc: "mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>" <mpiwg-persistence at lists.mpi-forum.org<mailto:mpiwg-persistence at lists.mpi-forum.org>>, Anthony Skjellum <skjellum at auburn.edu<mailto:skjellum at auburn.edu>>
Subject: Re: persistent blocking collectives
Thank you for your suggestion. This is an interesting area of API design for MPI. Let me jot down some notes in response to your points.
The MPI_Start function is used by both our proposed persistent collective communications and the existing persistent point-to-point communications. For consistency in the MPI Standard, any change to MPI_Start must be applied to point-to-point as well.
Our implementation work for persistent collective communication currently leverages point-to-point communication in a similar manner to your description of the tree broadcast. However, this is not required by the MPI Standard and is known to be a sub-optimal implementation choice. The interface design should not be determined by the needs of a poor implementation method.
All schedules for persistent collective communication operations involve multiple “rounds”. Each round concludes with a dependency on one or more remote MPI processes, i.e. a “wait”. This is not the case with point-to-point, where lower latency can be achieved with a fire-and-forget approach in some situations (ready mode or small eager protocol messages). Even for small buffer sizes, there is no ready mode or eager protocol for collective communications.
There is ongoing debate about the best method for implementing “wait”, e.g. active polling (spin wait) or signals (idle wait), etc. For collective operations, the inter-round “wait” could be avoided in many cases by using triggered operations - an incoming network packet is processed by the network hardware and triggers one or more response packets. Your “wait for receive, send to children” steps would then be “trigger store-and-foward on receive” programmed into the NIC itself. Having the CPU blocked would be a waste of resources for this implementation. This strongly argues that nonblocking should exist in the API, even if blocking is also added. Nonblocking already exists - MPI_Start.
With regards to interface naming, I would suggest MPI_Start_and_wait, and MPI_Start_and_test. You would also need to consider MPI_Startall_and_waitall and MPI_Startall_and_testall. I would avoid adding additional variants based on MPI_[Wait|Test][any|some].
There has been a lengthy debate about whether the persistent collective initialisation functions could/should be blocking or nonblocking. This issue is similar. One could envisage:
// fully non-blocking route - maximum opportunity for overlap - assumes normally slow network
MPI_Ireduce_init // begin optimisation of a reduction
MPI_Test // repeatedly test for completion of the optimisation of the reduction
MPI_Istart // begin the reduction communication
MPI_Test // repeatedly test for completion of the reduction communication
MPI_Request_free // recover resources
// fully blocking route - minimum opportunity for overlap - assumes infinitely fast network
MPI_Reduce_init // optimise a reduction, blocking
MPI_Start // do the reduction communication, blocking
MPI_Request_free // recover resources
Some proposed optimisations take a long time and require collective communication, so we have chosen nonblocking initialisation. The current persistent communication workflow is initialise -> (start -> complete)* -> free, so we are not proposing to have the first MPI_Test in the example above. The existing MPI_Start is nonblocking so our proposal is basically the first of the examples above. It is a minimum change to the MPI Standard to achieve our main goal, i.e. permit a planning step for collective communications. It does not exclude or prevent additional proposals that extend the API in the manner that you suggest. However, such an extension would need a strong justification to succeed.
On 16 May 2017, at 22:33, Langer, Akhil <akhil.langer at intel.com<mailto:akhil.langer at intel.com>> wrote:
I want to propose an extension to persistent API to allow a blocking MPI_Start call. Currently, MPI_Start calls are non-blocking. So, proposal is something like MPI_Start (for blocking) and MPI_Istart (for non-blocking). Of course, to maintain backward compatibility we may have to think of an alternative API. I am not proposing the exact API here.
The motivation behind the proposal is that having the knowledge whether the corresponding MPI call is blocking or not can give better performance. For example, MPI_Isend followed by MPI_Wait is slower than the MPI_Send because internally MPI_Isend->MPI_Wait has to allocate additional data structures (for example, request pointer) and do more work. Similarly, lets look at an example of a bcast collective operation.
Tree based broadcast can be implemented in two ways:
1. MPI_Recv (recv data from parent) -> FOREACHCHILD – MPI_Send (send data to children)
2. MPI_Irecv (recv data from parent) -> MPI_Wait(wait for recv to complete) -> FOREACHCHILD – MPI_Isend (send data to childrent) -> MPI_WaitAll (wait for sends to complete)
Having only a non-blocking MPI_Start call forces only implementation 2 as implementation 1 has blocking MPI calls. However, implementation 1 can be significantly faster that implementation 2 for small message sizes.
Looking forward to hear your feedback.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the mpiwg-coll