[mpiwg-p2p] Ordering of P2P messages in multithreaded applications

Sat Nov 24 14:19:41 CST 2018

Dan,

MPI does not know.  That’s the point.  It has to assume that the order it sees was explicitly caused by the user through some algorithmic synchronization.

Option 1 is silly and would break applications.  Note that this doesn’t need to be OpenMP synchronization.  I can create an equivalent model with no openMP synchronization but having thread 0 send an MPI message to thread 1 (or even route it through a different process).  It’s impractical for MPI libraries to keep track of such dependencies that user algorithms can create.

The text already says option 2, IMO, but I can help make it even clearer.  I’m pretty sure every MPI implementation already does option 2.

Sent from my iPhone

On Nov 24, 2018, at 12:29 PM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>> wrote:

Hi Jeff/Pavan,

How does MPI determine the difference between “in such cases” and “in other cases not covered by the preceding sentence”? When is MPI permitted to ignore the chronological order and when is it required to preserve it?

Thus I think, in addition to this clarification, we should make it crystal clear what MPI is allowed/required to do in response to operations issued on different threads at different times.

Option 1) make it explicitly/unambiguously clear by changing the text that both implementations are permitted and equally valid (mine & Jeff’s interpretation of the current text). However, this means that the user can *never* (portably) rely on MPI preserving the chronological order, even when that order is enforced using thread synchronisation! This seems untenable/unacceptable to me. If the user cares about order, they must marshal all MPI calls onto a single thread! This tells me we must go for option 2.

Option 2) require that MPI preserve the ordering that it sees in all circumstances, whatever that is and whatever the reason for that ordering is. If the user cares about ordering then the user can use thread synchronisation to force the order that MPI sees, and then trust MPI to preserve that deterministic order all the way to the receiver. If the user does not care about order, then they omit the thread synchronisation and MPI will preserve the non-deterministic order it sees at the sender. This also defines what happens if the MPI calls are actually interleaved. I believe this is the current implementation in MPICH (as referenced by Pavan) but may not apply to *all* MPI libraries. If there is an implementation that does not currently preserve the chronological order (perhaps MVAPICH2), it will need to be changed to do so - possibly introducing additional overhead, such as sequence numbers, and reducing communication performance. The new INFO key mpi_assert_allow_overtaking gives MPI the permission to remove this (additional) overhead.

Suggestion:

"The non-overtaking rule extends to send and receive operations issued by different threads in an MPI process. The chronological ordering of send operations, as seen by the MPI library, shall be preserved in the same manner as if the send operations were issued in that sequence by a single thread in an MPI process. Similarly, the ordering of receive operations issued by different threads in an MPI process shall be preserved in the same manner as if they were issued by a single thread."

Or something like that.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Applications Consultant in HPC Research
d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
—

On 24 Nov 2018, at 17:45, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:

Sure, that’s a good fix.

Jeff

On Sat, Nov 24, 2018 at 9:40 AM Balaji, Pavan <balaji at anl.gov<mailto:balaji at anl.gov>> wrote:

That sentence is taken out of context.  It only makes sense when you place it in the right context.  Something like this:

“the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads. [In such cases] The operations are logically concurrent, even if one physically precedes the other.”

Would this alternate text work:

“the semantics of threading runtime might or might not define a relative order between two send operations executed by two distinct threads. In such cases, unless the user performs additional synchronization to explicitly order the operations, they are considered to be logically concurrent.”

Sent from my iPhone

On Nov 24, 2018, at 11:33 AM, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:

On Nov 24, 2018, at 8:56 AM, Balaji, Pavan <balaji at anl.gov<mailto:balaji at anl.gov>> wrote:

Jeff,

I’m OK with adding additional text to clarify it.

FWIW, I still think the text is not ambiguous.  In particular, it is simply warning the user that the thread execution may not define a relative order (as in, you might accidentally get some order because of the OS behavior).  That does not mean that one cannot achieve a relative order using additional synchronization outside of MPI.

How is the relative order outside of MPI on which you intend to rely not the physical order referenced in the following?

The operations are logically concurrent, even if one physically precedes the other.

In any case, let’s just add some text as you suggested below instead of arguing about it.

It’s hard to know what the right fix is when we cannot agree on whether there is a problem in the first place.

Jeff

  — Pavan

Sent from my iPhone

On Nov 24, 2018, at 10:38 AM, Jeff Hammond <jeff.science at gmail.com<mailto:jeff.science at gmail.com>> wrote:

On Fri, Nov 23, 2018 at 2:59 PM Balaji, Pavan <balaji at anl.gov<mailto:balaji at anl.gov>> wrote:
Hi Dan,

> On Nov 23, 2018, at 4:11 AM, HOLMES Daniel <d.holmes at epcc.ed.ac.uk<mailto:d.holmes at epcc.ed.ac.uk>> wrote:
> However, it is *also* a correct implementation choice to ignore that “physical order” even in this case because the MPI library does not know, and cannot determine, *why* that “physical order” happened.

I don't think this is a correct implementation and I'm not sure what part of the chapter is causing you to interpret this as a correct implementation.  If there's algorithmic logic in the application to guarantee an order, then those operations are not logically concurrent.  Although I'm happy to help clarify something that's unclear in the standard, I'm at a loss as to what is unclear here.

As I included before, this is the relevant text:

If a process has a single thread of execution, then any two communications executed by this process are ordered. On the other hand, if the process is multithreaded, then the semantics of thread execution may not define a relative order between two send operations executed by two distinct threads. The operations are logically concurrent, even if one physically precedes the other. In such a case, the two messages sent can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the two receives in either order.

The problem with the text is that it does not state any means for the user to logically order operations on different threads.  The explicit statement that physical order does not imply logical order means that users cannot rely on the order of thread execution alone.

The solution to this problem is to add text that indicates that the user can impart a logical order via thread synchronization primitives that order the execution of sends and weaken the problematic sentence such that it only applies when physical ordering is coincidental and not the result of any synchronization between threads.

FWIW, every implementation of MPI that I know of interprets the standard the way I stated it, i.e., those operations are not concurrent and the MPI library has to process them in the order that it sees it.  Whether that is an explicit scheduling done by the user or is an accidental schedule created by the OS cannot be determined by the MPI library, so it better respect the order that it sees.

It would be good to look at MPI implementations that support multi-rail interconnects.  How does MVAPICH2 mrail implement ordering in this case?  Do they just use one rail per process or one rail per communicator?

Jeff

  -- Pavan

--
Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com>
http://jeffhammond.github.io/
--
Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com>
http://jeffhammond.github.io/

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-p2p/attachments/20181124/6f12feeb/attachment-0001.html>