[mpiwg-tools] [mpiwg-p2p] Message matching for tools

Thu Dec 17 15:48:14 CST 2015

Hi Marc-Andre,

Let me leave my comments on “parallel replay”.
I've just subscribed this list, so if my comment does not make sense, just drop this email.

RE:
We analyze on the same scale as the measurement, thus we have one
thread per thread-local trace. Each thread processes its own
thread-local trace. When encountering a communication event, it
re-enacts this communication using the recorded communication
parameters (rank, tag, comm). A send event leads to an issued send, a
receive event leads to an issued receive.

(1) Replaying receive events
Papers about “parallel replay (or record-and-replay)” uses (rank, rag, comm) for correct replay of message receive orders.
Unfortunately, (rank, tag, comm) cannot replay message receive orders even in MPI_THREAD_SINGLE  ** In general ** (Of course, it may work in particular case).
You need to record (rank, message_id_number), and actually (tag, comm) does not work for this purpose.
The details is described in Section 3.1 of this paper ( http://dl.acm.org/citation.cfm?id=2807642 ).

(2) Replaying send events
Sorry if you’re already aware of this.
In some applications, usage of send calls, the destitution and the message payload can change across different runs.
So you also need to record ALL non-deterministic operations which affect MPI send behaviors.
One of the examples is seed values for void srand(unsigned int seed) for random numbers.
time-related function, such as gettimeofday(), can also be the example, thereby MPI_Wtime().

RE:
a) Sending an additional message within the MPI wrapper at measurement
time may lead to invalid matchings, as the additional message may be
received by a different thread.

Yes, that’s true.

RE;
b) Creating a derived datatype on the fly to add tool-level data to
the original payload may induce a large overhead in practically
_every_ send & receive operation and perturb the measurement.

Yes, if it’s for performance analysis, it’ll somehow perturb the measurement.
This paper will help you to see if the piggybacking overhead is acceptable or not.
 http://greg.bronevetsky.com/papers/2008EuroPVM.pdf .

It evaluates three piggybacking methods: (1) Explicit Pack Operations, (2) Datatypes and (3) Separate Messages.
But as you pointed out (3) would not work for MPI_THREAD_MULTIPLE .

Kento

________________________________
Kento Sato | Center for Applied Scientific Computing (CASC) | Lawrence Livermore National Laboratory (LLNL) | http://people.llnl.gov/sato5 |

On Dec 17, 2015, at 6:32 AM, Marc-Andre Hermanns <hermanns at jara.rwth-aachen.de<mailto:hermanns at jara.rwth-aachen.de>> wrote:

Hi Jeff,

at the moment we don't handle MPI_THREAD_MULTIPLE at all. But we want
to get there ;-)

Here is a short recollection of what we do/need. Sorry for the folks
who know/read this already in other context:

We use, what we call "parallel replay" to analyze large event traces
in parallel. Each thread has its own stream of events, such as enter
and exit for tracking the calling context as well as send and receive
for communication among ranks.

We analyze on the same scale as the measurement, thus we have one
thread per thread-local trace. Each thread processes its own
thread-local trace. When encountering a communication event, it
re-enacts this communication using the recorded communication
parameters (rank, tag, comm). A send event leads to an issued send, a
receive event leads to an issued receive.

It is critical that during the analysis, the message matching is
identical to the original application. However, we do not re-enact any
computational time, that is the temporal distance between sends and
receives is certainly different from the original application. As a
consequence, while two sends may have some significant temporal
distance in the original measurement, they could be issued right after
each other during the analysis.

Markus Geimer and I believe that creating some sort of a sequence
number during measurement could help matching the right messages
during the analysis, as a process could detect that it got mismatched
messages and communicate with other threads to get the correct one.

It is unclear, however, how to achieve this:

a) Sending an additional message within the MPI wrapper at measurement
time may lead to invalid matchings, as the additional message may be
received by a different thread.

b) Creating a derived datatype on the fly to add tool-level data to
the original payload may induce a large overhead in practically
_every_ send & receive operation and perturb the measurement.

The best idea in the room so far is some callback mechanism into the
MPI implementation that generates matching information both on sender
and receiver side to generate some form of a sequence number that can
then be saved during measurement. If available on both sender and
receiver this information could then be used to fix incorrect matching
during the analysis.

Cheers,
Marc-Andre

On 16.12.15 16:40, Jeff Hammond wrote:
How do you handle MPI_THREAD_MULTIPLE?  Understanding what your tool
does there is a good starting point for this discussion.

Jeff

On Wed, Dec 16, 2015 at 1:37 AM, Marc-Andre Hermanns
<hermanns at jara.rwth-aachen.de<mailto:hermanns at jara.rwth-aachen.de> <mailto:hermanns at jara.rwth-aachen.de>>
wrote:

   Hi all,

   CC: Tools-WG, Markus Geimer (not on either list)

   sorry for starting a new thread and being so verbose, but I subscribed
   just now. I quoted Dan, Jeff, and Jim from the archive as appropriate.

   First, let me state that we do not want to prevent this assertion in
   any way. For us as tools provider it is just quite a brain tickler on
   how to support this in our tool and in general.

   Dan wrote:
[...] The basic problem is that message matching would be
non-deterministic and it would be impossible for a tool to show
the user which receive operation satisfied which send operation
without internally using some sort of sequence number for each
send/receive operation. [...]

My responses were:
1) the user asked for this behaviour so the tool could simply
gracefully give up the linking function and just state the
information it knows

   Giving up can only be a temporary solution for tools. The user wants
   to use this advanced feature, thus just saying: "Hey, what you're
   doing is too sophisticated for us. You are on your own now." is not a
   viable long-term strategy.

2) the tool could hook all send and receive operations and
piggy-back a sequence number into the message header

   We discussed piggy-backing within the tools group some time in the
   past, but never came to a satisfying way of how to implement this. If,
   in the process of reviving the discussion on a piggy-backing
   interface, we come to a viable solution, it would certainly help with
   the our issues with message matching in general.

   Scalasca's problem here is that we need to detect (and partly
   recreate) the exact order of message matching to have the correct
   message reach the right receivers.

3) the tool could hook all send and receive operations and
serialise them to prevent overtaking

   This is not an option for me. A "performance tool" should strive to
   measure as close to the original behavior as possible. Changing
   communication semantics just to make a tool "work" would have too
   great of an impact on application behavior. After all, if it would
   have only little impact, why should the user choose this option in the
   first place.

   Jeff wrote:
Remember that one of the use cases of allow_overtaking is
   applications that
have exact matching, in which case allow_overtaking is a way of
   turning off
a feature that isn't used, in order to get a high-performing
   message queue
implementation. In the exact matching case, tools will have no
   problem
matching up sends and recvs.

   This is true. If the tools can identify this scenario, it could be
   supported by current tools without significant change. However, as it
   is not generally forbidden to have inexact matching (right?), it is
   unclear on how the tools would detect this.

   What about an additional info key a user can set in this respect:

   exact_matching => true/false

   in which the user can state whether it is indeed a scenario of exact
   matching or not. The tool could check this, and issue a warning.

If tools cannot handle MPI_THREAD_MULTIPLE already, then I
   don't really
care if they can't support this assertion either.

   Not handling MPI_THREAD_MULTIPLE generally is not carved in stone. ;-)

   As I said, we (Markus and I) see this as a trigger to come to a viable
   solution for tools like ours to support either situation.

And in any case, such tools can just intercept the info
   operations and
strip this key if they can't support it.

   As I wrote above in reply to Dan, stripping options that influence
   behavior is not a good option. I, personally, would rather bail out
   than (silently) change messaging semantics. I can't say what Markus'
   take on this is.

   Jim wrote:
I don't really see any necessary fix to the proposal. We could
   add an
advice to users to remind them that they should ensure tools are
   compatible
with the info keys. And the reverse advice to tools writers that
   they
should check info keys for compatibility.

   I would second this idea, while emphasizing the burden to be on the
   tool to check for this info key (and potentially others) and warn the
   user of "undersupport".

   Cheers,
   Marc-Andre
   --
   Marc-Andre Hermanns
   Jülich Aachen Research Alliance,
   High Performance Computing (JARA-HPC)
   Jülich Supercomputing Centre (JSC)

   Schinkelstrasse 2
   52062 Aachen
   Germany

   Phone: +49 2461 61 2509 | +49 241 80 24381
   Fax: +49 2461 80 6 99753
   www.jara.org/jara-hpc<http://www.jara.org/jara-hpc> <http://www.jara.org/jara-hpc>
   email: hermanns at jara.rwth-aachen.de<mailto:hermanns at jara.rwth-aachen.de>
   <mailto:hermanns at jara.rwth-aachen.de>

   _______________________________________________
   mpiwg-p2p mailing list
   mpiwg-p2p at lists.mpi-forum.org<mailto:mpiwg-p2p at lists.mpi-forum.org> <mailto:mpiwg-p2p at lists.mpi-forum.org>
   http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p

--
Jeff Hammond
jeff.science at gmail.com<mailto:jeff.science at gmail.com> <mailto:jeff.science at gmail.com>
http://jeffhammond.github.io/

_______________________________________________
mpiwg-p2p mailing list
mpiwg-p2p at lists.mpi-forum.org<mailto:mpiwg-p2p at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p

--
Marc-Andre Hermanns
Jülich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
Jülich Supercomputing Centre (JSC)

Schinkelstrasse 2
52062 Aachen
Germany

Phone: +49 2461 61 2509 | +49 241 80 24381
Fax: +49 2461 80 6 99753
www.jara.org/jara-hpc<http://www.jara.org/jara-hpc>
email: hermanns at jara.rwth-aachen.de<mailto:hermanns at jara.rwth-aachen.de>

_______________________________________________
mpiwg-tools mailing list
mpiwg-tools at lists.mpi-forum.org<mailto:mpiwg-tools at lists.mpi-forum.org>
http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20151217/cf5fe477/attachment-0001.html>