[mpiwg-p2p] Message matching for tools

Daniel Holmes dholmes at epcc.ed.ac.uk
Thu Dec 17 09:50:31 CST 2015

Hi Marc-Andre,

The mpi_assert_allow_overtaking INFO key essentially makes *all* send 
operations that have identical matching criteria logically concurrent 
and *all* receive operations that have overlapping matching criteria 
logically concurrent, irrespective of which thread issued them. So, 
using the INFO key can be seen as a special case of handling the full 

In the existing MPI Standard, any program that relies on the tool being 
able to recover the particular matching order that actually happened 
from a set of possible matching orders that all satisfy the MPI 
definitions (for example by relying on the tool using some externally 
imposed mechanism, such as piggy-backed sequence numbers) is relying on 
a particular sequentialisation of a race-condition between logically 
concurrent MPI messages, which is specifically called out as being an 
erroneous program.

 From MPI Forum's point-of-view, therefore, any behaviour in this 
situation is allowed, including setting the data-centre on fire, and our 
work here is done.

 From a tool's perspective, however, the existence and consequences of 
this situation should be discovered and presented to the user so that 
they can fix their erroneous program. This could be done by recognising 
when multiple send operations are logically concurrent and when multiple 
receive operations are logically concurrent. If two send operations are 
issued with the same matching criteria but by different threads then 
these will be logically concurrent (edit: unless an MPI synchronisation 
point exists such that the two cannot conflict). The same is true of 
receive operations.

The existence of this situation could be recorded during measurement (at 
the cost of searching the entire trace so far for the OS process (edit: 
since the last MPI synchronisation point) each time a new send or 
receive is issued). Alternatively, the existence of this situation could 
be discovered by examination of the trace during post-processing or 

The consequences of this situation could be presented to the user by 
showing all possible matching orders, as indicated by the information in 
the trace concerning which messages were logically concurrent during the 
actual run of the application. The tool would not be able to tell the 
user which matching order actually occurred during the measurement run 
but it would be able to identify that there was a race-condition, 
display all the possible outcomes, and simulate the effects of each 
possible route through the program.

Does this make sense and is this sufficient?


On 17/12/2015 14:32, Marc-Andre Hermanns wrote:
> Hi Jeff,
> at the moment we don't handle MPI_THREAD_MULTIPLE at all. But we want
> to get there ;-)
> Here is a short recollection of what we do/need. Sorry for the folks
> who know/read this already in other context:
> We use, what we call "parallel replay" to analyze large event traces
> in parallel. Each thread has its own stream of events, such as enter
> and exit for tracking the calling context as well as send and receive
> for communication among ranks.
> We analyze on the same scale as the measurement, thus we have one
> thread per thread-local trace. Each thread processes its own
> thread-local trace. When encountering a communication event, it
> re-enacts this communication using the recorded communication
> parameters (rank, tag, comm). A send event leads to an issued send, a
> receive event leads to an issued receive.
> It is critical that during the analysis, the message matching is
> identical to the original application. However, we do not re-enact any
> computational time, that is the temporal distance between sends and
> receives is certainly different from the original application. As a
> consequence, while two sends may have some significant temporal
> distance in the original measurement, they could be issued right after
> each other during the analysis.
> Markus Geimer and I believe that creating some sort of a sequence
> number during measurement could help matching the right messages
> during the analysis, as a process could detect that it got mismatched
> messages and communicate with other threads to get the correct one.
> It is unclear, however, how to achieve this:
> a) Sending an additional message within the MPI wrapper at measurement
> time may lead to invalid matchings, as the additional message may be
> received by a different thread.
> b) Creating a derived datatype on the fly to add tool-level data to
> the original payload may induce a large overhead in practically
> _every_ send & receive operation and perturb the measurement.
> The best idea in the room so far is some callback mechanism into the
> MPI implementation that generates matching information both on sender
> and receiver side to generate some form of a sequence number that can
> then be saved during measurement. If available on both sender and
> receiver this information could then be used to fix incorrect matching
> during the analysis.
> Cheers,
> Marc-Andre
> On 16.12.15 16:40, Jeff Hammond wrote:
>> How do you handle MPI_THREAD_MULTIPLE?  Understanding what your tool
>> does there is a good starting point for this discussion.
>> Jeff
>> On Wed, Dec 16, 2015 at 1:37 AM, Marc-Andre Hermanns
>> <hermanns at jara.rwth-aachen.de <mailto:hermanns at jara.rwth-aachen.de>>
>> wrote:
>>      Hi all,
>>      CC: Tools-WG, Markus Geimer (not on either list)
>>      sorry for starting a new thread and being so verbose, but I subscribed
>>      just now. I quoted Dan, Jeff, and Jim from the archive as appropriate.
>>      First, let me state that we do not want to prevent this assertion in
>>      any way. For us as tools provider it is just quite a brain tickler on
>>      how to support this in our tool and in general.
>>      Dan wrote:
>>      >>> [...] The basic problem is that message matching would be
>>      >>> non-deterministic and it would be impossible for a tool to show
>>      >>> the user which receive operation satisfied which send operation
>>      >>> without internally using some sort of sequence number for each
>>      >>> send/receive operation. [...]
>>      >>>
>>      >>> My responses were:
>>      >>> 1) the user asked for this behaviour so the tool could simply
>>      >>> gracefully give up the linking function and just state the
>>      >>> information it knows
>>      >
>>      Giving up can only be a temporary solution for tools. The user wants
>>      to use this advanced feature, thus just saying: "Hey, what you're
>>      doing is too sophisticated for us. You are on your own now." is not a
>>      viable long-term strategy.
>>      >>> 2) the tool could hook all send and receive operations and
>>      >>> piggy-back a sequence number into the message header
>>      We discussed piggy-backing within the tools group some time in the
>>      past, but never came to a satisfying way of how to implement this. If,
>>      in the process of reviving the discussion on a piggy-backing
>>      interface, we come to a viable solution, it would certainly help with
>>      the our issues with message matching in general.
>>      Scalasca's problem here is that we need to detect (and partly
>>      recreate) the exact order of message matching to have the correct
>>      message reach the right receivers.
>>      >>> 3) the tool could hook all send and receive operations and
>>      >>> serialise them to prevent overtaking
>>      This is not an option for me. A "performance tool" should strive to
>>      measure as close to the original behavior as possible. Changing
>>      communication semantics just to make a tool "work" would have too
>>      great of an impact on application behavior. After all, if it would
>>      have only little impact, why should the user choose this option in the
>>      first place.
>>      Jeff wrote:
>>      >> Remember that one of the use cases of allow_overtaking is
>>      applications that
>>      >> have exact matching, in which case allow_overtaking is a way of
>>      turning off
>>      >> a feature that isn't used, in order to get a high-performing
>>      message queue
>>      >> implementation. In the exact matching case, tools will have no
>>      problem
>>      >> matching up sends and recvs.
>>      This is true. If the tools can identify this scenario, it could be
>>      supported by current tools without significant change. However, as it
>>      is not generally forbidden to have inexact matching (right?), it is
>>      unclear on how the tools would detect this.
>>      What about an additional info key a user can set in this respect:
>>      exact_matching => true/false
>>      in which the user can state whether it is indeed a scenario of exact
>>      matching or not. The tool could check this, and issue a warning.
>>      >> If tools cannot handle MPI_THREAD_MULTIPLE already, then I
>>      don't really
>>      >> care if they can't support this assertion either.
>>      Not handling MPI_THREAD_MULTIPLE generally is not carved in stone. ;-)
>>      As I said, we (Markus and I) see this as a trigger to come to a viable
>>      solution for tools like ours to support either situation.
>>      >> And in any case, such tools can just intercept the info
>>      operations and
>>      >> strip this key if they can't support it.
>>      As I wrote above in reply to Dan, stripping options that influence
>>      behavior is not a good option. I, personally, would rather bail out
>>      than (silently) change messaging semantics. I can't say what Markus'
>>      take on this is.
>>      Jim wrote:
>>      > I don't really see any necessary fix to the proposal. We could
>>      add an
>>      > advice to users to remind them that they should ensure tools are
>>      compatible
>>      > with the info keys. And the reverse advice to tools writers that
>>      they
>>      > should check info keys for compatibility.
>>      I would second this idea, while emphasizing the burden to be on the
>>      tool to check for this info key (and potentially others) and warn the
>>      user of "undersupport".
>>      Cheers,
>>      Marc-Andre
>>      --
>>      Marc-Andre Hermanns
>>      Jülich Aachen Research Alliance,
>>      High Performance Computing (JARA-HPC)
>>      Jülich Supercomputing Centre (JSC)
>>      Schinkelstrasse 2
>>      52062 Aachen
>>      Germany
>>      Phone: +49 2461 61 2509 | +49 241 80 24381
>>      Fax: +49 2461 80 6 99753
>>      www.jara.org/jara-hpc <http://www.jara.org/jara-hpc>
>>      email: hermanns at jara.rwth-aachen.de
>>      <mailto:hermanns at jara.rwth-aachen.de>
>>      _______________________________________________
>>      mpiwg-p2p mailing list
>>      mpiwg-p2p at lists.mpi-forum.org <mailto:mpiwg-p2p at lists.mpi-forum.org>
>>      http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p
>> -- 
>> Jeff Hammond
>> jeff.science at gmail.com <mailto:jeff.science at gmail.com>
>> http://jeffhammond.github.io/
>> _______________________________________________
>> mpiwg-p2p mailing list
>> mpiwg-p2p at lists.mpi-forum.org
>> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p
> _______________________________________________
> mpiwg-p2p mailing list
> mpiwg-p2p at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p

Dan Holmes
Applications Consultant in HPC Research
EPCC, The University of Edinburgh
James Clerk Maxwell Building
The Kings Buildings
Peter Guthrie Tait Road
T: +44(0)131 651 3465
E: dholmes at epcc.ed.ac.uk

*Please consider the environment before printing this email.*

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-p2p/attachments/20151217/82b221a2/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-p2p/attachments/20151217/82b221a2/attachment-0001.ksh>

More information about the mpiwg-p2p mailing list