<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">

Hi Marc-Andre,

<div><br>

</div>

<div>Let me leave my comments on “parallel replay”. </div>

<div>I've just subscribed this list, so if my comment does not make sense, just drop this email.</div>

<div><br>

</div>

<div><br>

</div>

<div>RE:</div>

<div>

<blockquote type="cite">We analyze on the same scale as the measurement, thus we have one<br>

thread per thread-local trace. Each thread processes its own<br>

thread-local trace. When encountering a communication event, it<br>

re-enacts this communication using the recorded communication<br>

parameters (rank, tag, comm). A send event leads to an issued send, a<br>

receive event leads to an issued receive.<br>

</blockquote>

<div><br>

</div>

<div>(1) Replaying receive events</div>

<div>Papers about “parallel replay (or record-and-replay)” uses (rank, rag, comm) for correct replay of message receive orders.</div>

<div>Unfortunately, (rank, tag, comm) cannot replay message receive orders even in MPI_THREAD_SINGLE  ** In general ** (Of course, it may work in particular case).</div>

<div>You need to record (rank, message_id_number), and actually (tag, comm) does not work for this purpose.</div>

<div>The details is described in Section 3.1 of this paper ( <a href="http://dl.acm.org/citation.cfm?id=2807642">http://dl.acm.org/citation.cfm?id=2807642</a> ). </div>

<div><br>

</div>

<div><br>

</div>

<div>(2) Replaying send events</div>

<div>Sorry if you’re already aware of this.</div>

<div>In some applications, usage of send calls, the destitution and the message payload can change across different runs.</div>

<div>So you also need to record ALL non-deterministic operations which affect MPI send behaviors.</div>

<div>One of the examples is seed values for <b style="widows: 1;">void srand(unsigned int</b><span style="widows: 1;"> </span><i style="widows: 1;">seed</i><b style="widows: 1;">)</b> for random numbers.</div>

<div>time-related function, such as gettimeofday(), can also be the example, thereby MPI_Wtime().</div>

<div><br>

</div>

<div><br>

</div>

RE:<br>

<blockquote type="cite">a) Sending an additional message within the MPI wrapper at measurement<br>

time may lead to invalid matchings, as the additional message may be<br>

received by a different thread.<br>

</blockquote>

<div><br>

</div>

<div>Yes, that’s true.</div>

<div><br>

</div>

<br>

RE;<br>

<blockquote type="cite">b) Creating a derived datatype on the fly to add tool-level data to<br>

the original payload may induce a large overhead in practically<br>

_every_ send & receive operation and perturb the measurement.<br>

</blockquote>

<div><br>

</div>

<div>Yes, if it’s for performance analysis, it’ll somehow perturb the measurement.</div>

<div>This paper will help you to see if the piggybacking overhead is acceptable or not.</div>

<div> <a href="http://greg.bronevetsky.com/papers/2008EuroPVM.pdf">http://greg.bronevetsky.com/papers/2008EuroPVM.pdf</a> .</div>

<div><br>

</div>

<div>It evaluates three piggybacking methods: (1) Explicit Pack Operations, (2) Datatypes and (3) Separate Messages.</div>

<div>But as you pointed out (3) would not work for MPI_THREAD_MULTIPLE .</div>

<div><br>

</div>

<div>Kento</div>

</div>

<div>

<div align="left" style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">

<span lang="EN-US" style="font-size: 10pt; color: rgb(31, 73, 125);"><br class="Apple-interchange-newline">

<hr width="100%" align="left" size="2">

</span></div>

<b style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; color: rgb(51, 102, 255);">Kento

 Sato</b><span style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none;"> </span><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; color: rgb(102, 102, 102);">|

 Center for Applied Scientific Computing (CASC) | Lawrence Livermore National Laboratory (LLNL) | </span><a href="http://people.llnl.gov/sato5" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">http://people.llnl.gov/sato5</a><span style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none;"> </span><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; color: rgb(102, 102, 102);">|</span>

</div>

<br>

<div>

<div>On Dec 17, 2015, at 6:32 AM, Marc-Andre Hermanns <<a href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a>> wrote:</div>

<br class="Apple-interchange-newline">

<blockquote type="cite">Hi Jeff,<br>

<br>

at the moment we don't handle MPI_THREAD_MULTIPLE at all. But we want<br>

to get there ;-)<br>

<br>

Here is a short recollection of what we do/need. Sorry for the folks<br>

who know/read this already in other context:<br>

<br>

We use, what we call "parallel replay" to analyze large event traces<br>

in parallel. Each thread has its own stream of events, such as enter<br>

and exit for tracking the calling context as well as send and receive<br>

for communication among ranks.<br>

<br>

We analyze on the same scale as the measurement, thus we have one<br>

thread per thread-local trace. Each thread processes its own<br>

thread-local trace. When encountering a communication event, it<br>

re-enacts this communication using the recorded communication<br>

parameters (rank, tag, comm). A send event leads to an issued send, a<br>

receive event leads to an issued receive.<br>

<br>

It is critical that during the analysis, the message matching is<br>

identical to the original application. However, we do not re-enact any<br>

computational time, that is the temporal distance between sends and<br>

receives is certainly different from the original application. As a<br>

consequence, while two sends may have some significant temporal<br>

distance in the original measurement, they could be issued right after<br>

each other during the analysis.<br>

<br>

Markus Geimer and I believe that creating some sort of a sequence<br>

number during measurement could help matching the right messages<br>

during the analysis, as a process could detect that it got mismatched<br>

messages and communicate with other threads to get the correct one.<br>

<br>

<br>

It is unclear, however, how to achieve this:<br>

<br>

a) Sending an additional message within the MPI wrapper at measurement<br>

time may lead to invalid matchings, as the additional message may be<br>

received by a different thread.<br>

<br>

b) Creating a derived datatype on the fly to add tool-level data to<br>

the original payload may induce a large overhead in practically<br>

_every_ send & receive operation and perturb the measurement.<br>

<br>

The best idea in the room so far is some callback mechanism into the<br>

MPI implementation that generates matching information both on sender<br>

and receiver side to generate some form of a sequence number that can<br>

then be saved during measurement. If available on both sender and<br>

receiver this information could then be used to fix incorrect matching<br>

during the analysis.<br>

<br>

Cheers,<br>

Marc-Andre<br>

<br>

On 16.12.15 16:40, Jeff Hammond wrote:<br>

<blockquote type="cite">How do you handle MPI_THREAD_MULTIPLE?  Understanding what your tool<br>

does there is a good starting point for this discussion.<br>

<br>

Jeff<br>

<br>

On Wed, Dec 16, 2015 at 1:37 AM, Marc-Andre Hermanns<br>

<<a href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a> <<a href="mailto:hermanns@jara.rwth-aachen.de">mailto:hermanns@jara.rwth-aachen.de</a>>><br>

wrote:<br>

<br>

   Hi all,<br>

<br>

   CC: Tools-WG, Markus Geimer (not on either list)<br>

<br>

   sorry for starting a new thread and being so verbose, but I subscribed<br>

   just now. I quoted Dan, Jeff, and Jim from the archive as appropriate.<br>

<br>

   First, let me state that we do not want to prevent this assertion in<br>

   any way. For us as tools provider it is just quite a brain tickler on<br>

   how to support this in our tool and in general.<br>

<br>

   Dan wrote:<br>

<blockquote type="cite">

<blockquote type="cite">

<blockquote type="cite">[...] The basic problem is that message matching would be<br>

non-deterministic and it would be impossible for a tool to show<br>

the user which receive operation satisfied which send operation<br>

without internally using some sort of sequence number for each<br>

send/receive operation. [...]<br>

<br>

My responses were:<br>

1) the user asked for this behaviour so the tool could simply<br>

gracefully give up the linking function and just state the<br>

information it knows<br>

</blockquote>

</blockquote>

<br>

</blockquote>

   Giving up can only be a temporary solution for tools. The user wants<br>

   to use this advanced feature, thus just saying: "Hey, what you're<br>

   doing is too sophisticated for us. You are on your own now." is not a<br>

   viable long-term strategy.<br>

<br>

<blockquote type="cite">

<blockquote type="cite">

<blockquote type="cite">2) the tool could hook all send and receive operations and<br>

piggy-back a sequence number into the message header<br>

</blockquote>

</blockquote>

</blockquote>

<br>

   We discussed piggy-backing within the tools group some time in the<br>

   past, but never came to a satisfying way of how to implement this. If,<br>

   in the process of reviving the discussion on a piggy-backing<br>

   interface, we come to a viable solution, it would certainly help with<br>

   the our issues with message matching in general.<br>

<br>

   Scalasca's problem here is that we need to detect (and partly<br>

   recreate) the exact order of message matching to have the correct<br>

   message reach the right receivers.<br>

<br>

<blockquote type="cite">

<blockquote type="cite">

<blockquote type="cite">3) the tool could hook all send and receive operations and<br>

serialise them to prevent overtaking<br>

</blockquote>

</blockquote>

</blockquote>

<br>

   This is not an option for me. A "performance tool" should strive to<br>

   measure as close to the original behavior as possible. Changing<br>

   communication semantics just to make a tool "work" would have too<br>

   great of an impact on application behavior. After all, if it would<br>

   have only little impact, why should the user choose this option in the<br>

   first place.<br>

<br>

   Jeff wrote:<br>

<blockquote type="cite">

<blockquote type="cite">Remember that one of the use cases of allow_overtaking is<br>

</blockquote>

</blockquote>

   applications that<br>

<blockquote type="cite">

<blockquote type="cite">have exact matching, in which case allow_overtaking is a way of<br>

</blockquote>

</blockquote>

   turning off<br>

<blockquote type="cite">

<blockquote type="cite">a feature that isn't used, in order to get a high-performing<br>

</blockquote>

</blockquote>

   message queue<br>

<blockquote type="cite">

<blockquote type="cite">implementation. In the exact matching case, tools will have no<br>

</blockquote>

</blockquote>

   problem<br>

<blockquote type="cite">

<blockquote type="cite">matching up sends and recvs.<br>

</blockquote>

</blockquote>

<br>

   This is true. If the tools can identify this scenario, it could be<br>

   supported by current tools without significant change. However, as it<br>

   is not generally forbidden to have inexact matching (right?), it is<br>

   unclear on how the tools would detect this.<br>

<br>

   What about an additional info key a user can set in this respect:<br>

<br>

   exact_matching => true/false<br>

<br>

   in which the user can state whether it is indeed a scenario of exact<br>

   matching or not. The tool could check this, and issue a warning.<br>

<br>

<blockquote type="cite">

<blockquote type="cite">If tools cannot handle MPI_THREAD_MULTIPLE already, then I<br>

</blockquote>

</blockquote>

   don't really<br>

<blockquote type="cite">

<blockquote type="cite">care if they can't support this assertion either.<br>

</blockquote>

</blockquote>

<br>

   Not handling MPI_THREAD_MULTIPLE generally is not carved in stone. ;-)<br>

<br>

   As I said, we (Markus and I) see this as a trigger to come to a viable<br>

   solution for tools like ours to support either situation.<br>

<br>

<blockquote type="cite">

<blockquote type="cite">And in any case, such tools can just intercept the info<br>

</blockquote>

</blockquote>

   operations and<br>

<blockquote type="cite">

<blockquote type="cite">strip this key if they can't support it.<br>

</blockquote>

</blockquote>

<br>

   As I wrote above in reply to Dan, stripping options that influence<br>

   behavior is not a good option. I, personally, would rather bail out<br>

   than (silently) change messaging semantics. I can't say what Markus'<br>

   take on this is.<br>

<br>

   Jim wrote:<br>

<blockquote type="cite">I don't really see any necessary fix to the proposal. We could<br>

</blockquote>

   add an<br>

<blockquote type="cite">advice to users to remind them that they should ensure tools are<br>

</blockquote>

   compatible<br>

<blockquote type="cite">with the info keys. And the reverse advice to tools writers that<br>

</blockquote>

   they<br>

<blockquote type="cite">should check info keys for compatibility.<br>

</blockquote>

<br>

   I would second this idea, while emphasizing the burden to be on the<br>

   tool to check for this info key (and potentially others) and warn the<br>

   user of "undersupport".<br>

<br>

   Cheers,<br>

   Marc-Andre<br>

   --<br>

   Marc-Andre Hermanns<br>

   Jülich Aachen Research Alliance,<br>

   High Performance Computing (JARA-HPC)<br>

   Jülich Supercomputing Centre (JSC)<br>

<br>

   Schinkelstrasse 2<br>

   52062 Aachen<br>

   Germany<br>

<br>

   Phone: +49 2461 61 2509 | +49 241 80 24381<br>

   Fax: +49 2461 80 6 99753<br>

   <a href="http://www.jara.org/jara-hpc">www.jara.org/jara-hpc</a> <<a href="http://www.jara.org/jara-hpc">http://www.jara.org/jara-hpc</a>><br>

   email: <a href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a><br>

   <<a href="mailto:hermanns@jara.rwth-aachen.de">mailto:hermanns@jara.rwth-aachen.de</a>><br>

<br>

<br>

   _______________________________________________<br>

   mpiwg-p2p mailing list<br>

   <a href="mailto:mpiwg-p2p@lists.mpi-forum.org">mpiwg-p2p@lists.mpi-forum.org</a> <<a href="mailto:mpiwg-p2p@lists.mpi-forum.org">mailto:mpiwg-p2p@lists.mpi-forum.org</a>><br>

   <a href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p</a><br>

<br>

<br>

<br>

<br>

-- <br>

Jeff Hammond<br>

<a href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a> <<a href="mailto:jeff.science@gmail.com">mailto:jeff.science@gmail.com</a>><br>

<a href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a><br>

<br>

<br>

_______________________________________________<br>

mpiwg-p2p mailing list<br>

<a href="mailto:mpiwg-p2p@lists.mpi-forum.org">mpiwg-p2p@lists.mpi-forum.org</a><br>

http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p<br>

<br>

</blockquote>

<br>

-- <br>

Marc-Andre Hermanns<br>

Jülich Aachen Research Alliance,<br>

High Performance Computing (JARA-HPC)<br>

Jülich Supercomputing Centre (JSC)<br>

<br>

Schinkelstrasse 2<br>

52062 Aachen<br>

Germany<br>

<br>

Phone: +49 2461 61 2509 | +49 241 80 24381<br>

Fax: +49 2461 80 6 99753<br>

<a href="http://www.jara.org/jara-hpc">www.jara.org/jara-hpc</a><br>

email: <a href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a><br>

<br>

_______________________________________________<br>

mpiwg-tools mailing list<br>

<a href="mailto:mpiwg-tools@lists.mpi-forum.org">mpiwg-tools@lists.mpi-forum.org</a><br>

http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools</blockquote>

</div>

<br>

</body>

</html>