<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Hi Marc-Andre,<br>

    <br>

    The mpi_assert_allow_overtaking INFO key essentially makes *all*

    send operations that have identical matching criteria logically

    concurrent and *all* receive operations that have overlapping

    matching criteria logically concurrent, irrespective of which thread

    issued them. So, using the INFO key can be seen as a special case of

    handling the full MPI_THREAD_MULTIPLE case.<br>

    <br>

    In the existing MPI Standard, any program that relies on the tool

    being able to recover the particular matching order that actually

    happened from a set of possible matching orders that all satisfy the

    MPI definitions (for example by relying on the tool using some

    externally imposed mechanism, such as piggy-backed sequence numbers)

    is relying on a particular sequentialisation of a race-condition

    between logically concurrent MPI messages, which is specifically

    called out as being an erroneous program.<br>

    <br>

    From MPI Forum's point-of-view, therefore, any behaviour in this

    situation is allowed, including setting the data-centre on fire, and

    our work here is done.<br>

    <br>

    From a tool's perspective, however, the existence and consequences

    of this situation should be discovered and presented to the user so

    that they can fix their erroneous program. This could be done by

    recognising when multiple send operations are logically concurrent

    and when multiple receive operations are logically concurrent. If

    two send operations are issued with the same matching criteria but

    by different threads then these will be logically concurrent (edit:

    unless an MPI synchronisation point exists such that the two cannot

    conflict). The same is true of receive operations.<br>

    <br>

    The existence of this situation could be recorded during measurement

    (at the cost of searching the entire trace so far for the OS process

    (edit: since the last MPI synchronisation point) each time a new

    send or receive is issued). Alternatively, the existence of this

    situation could be discovered by examination of the trace during

    post-processing or analysis.<br>

    <br>

    The consequences of this situation could be presented to the user by

    showing all possible matching orders, as indicated by the

    information in the trace concerning which messages were logically

    concurrent during the actual run of the application. The tool would

    not be able to tell the user which matching order actually occurred

    during the measurement run but it would be able to identify that

    there was a race-condition, display all the possible outcomes, and

    simulate the effects of each possible route through the program.<br>

    <br>

    Does this make sense and is this sufficient?<br>

    <br>

    Cheers,<br>

    Dan.<br>

    <br>

    <div class="moz-cite-prefix">On 17/12/2015 14:32, Marc-Andre

      Hermanns wrote:<br>

    </div>

    <blockquote cite="mid:5672C792.60606@jara.rwth-aachen.de"

      type="cite">

      <pre wrap="">Hi Jeff,

at the moment we don't handle MPI_THREAD_MULTIPLE at all. But we want

to get there ;-)

Here is a short recollection of what we do/need. Sorry for the folks

who know/read this already in other context:

We use, what we call "parallel replay" to analyze large event traces

in parallel. Each thread has its own stream of events, such as enter

and exit for tracking the calling context as well as send and receive

for communication among ranks.

We analyze on the same scale as the measurement, thus we have one

thread per thread-local trace. Each thread processes its own

thread-local trace. When encountering a communication event, it

re-enacts this communication using the recorded communication

parameters (rank, tag, comm). A send event leads to an issued send, a

receive event leads to an issued receive.

It is critical that during the analysis, the message matching is

identical to the original application. However, we do not re-enact any

computational time, that is the temporal distance between sends and

receives is certainly different from the original application. As a

consequence, while two sends may have some significant temporal

distance in the original measurement, they could be issued right after

each other during the analysis.

Markus Geimer and I believe that creating some sort of a sequence

number during measurement could help matching the right messages

during the analysis, as a process could detect that it got mismatched

messages and communicate with other threads to get the correct one.

It is unclear, however, how to achieve this:

a) Sending an additional message within the MPI wrapper at measurement

time may lead to invalid matchings, as the additional message may be

received by a different thread.

b) Creating a derived datatype on the fly to add tool-level data to

the original payload may induce a large overhead in practically

_every_ send & receive operation and perturb the measurement.

The best idea in the room so far is some callback mechanism into the

MPI implementation that generates matching information both on sender

and receiver side to generate some form of a sequence number that can

then be saved during measurement. If available on both sender and

receiver this information could then be used to fix incorrect matching

during the analysis.

Cheers,

Marc-Andre

On 16.12.15 16:40, Jeff Hammond wrote:

</pre>

      <blockquote type="cite">

        <pre wrap="">How do you handle MPI_THREAD_MULTIPLE?  Understanding what your tool

does there is a good starting point for this discussion.

Jeff

On Wed, Dec 16, 2015 at 1:37 AM, Marc-Andre Hermanns

<<a class="moz-txt-link-abbreviated" href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a> <a class="moz-txt-link-rfc2396E" href="mailto:hermanns@jara.rwth-aachen.de"><mailto:hermanns@jara.rwth-aachen.de></a>>

wrote:

    Hi all,

    CC: Tools-WG, Markus Geimer (not on either list)

    sorry for starting a new thread and being so verbose, but I subscribed

    just now. I quoted Dan, Jeff, and Jim from the archive as appropriate.

    First, let me state that we do not want to prevent this assertion in

    any way. For us as tools provider it is just quite a brain tickler on

    how to support this in our tool and in general.

    Dan wrote:

    >>> [...] The basic problem is that message matching would be

    >>> non-deterministic and it would be impossible for a tool to show

    >>> the user which receive operation satisfied which send operation

    >>> without internally using some sort of sequence number for each

    >>> send/receive operation. [...]

    >>>

    >>> My responses were:

    >>> 1) the user asked for this behaviour so the tool could simply

    >>> gracefully give up the linking function and just state the

    >>> information it knows

    >

    Giving up can only be a temporary solution for tools. The user wants

    to use this advanced feature, thus just saying: "Hey, what you're

    doing is too sophisticated for us. You are on your own now." is not a

    viable long-term strategy.

    >>> 2) the tool could hook all send and receive operations and

    >>> piggy-back a sequence number into the message header

    We discussed piggy-backing within the tools group some time in the

    past, but never came to a satisfying way of how to implement this. If,

    in the process of reviving the discussion on a piggy-backing

    interface, we come to a viable solution, it would certainly help with

    the our issues with message matching in general.

    Scalasca's problem here is that we need to detect (and partly

    recreate) the exact order of message matching to have the correct

    message reach the right receivers.

    >>> 3) the tool could hook all send and receive operations and

    >>> serialise them to prevent overtaking

    This is not an option for me. A "performance tool" should strive to

    measure as close to the original behavior as possible. Changing

    communication semantics just to make a tool "work" would have too

    great of an impact on application behavior. After all, if it would

    have only little impact, why should the user choose this option in the

    first place.

    Jeff wrote:

    >> Remember that one of the use cases of allow_overtaking is

    applications that

    >> have exact matching, in which case allow_overtaking is a way of

    turning off

    >> a feature that isn't used, in order to get a high-performing

    message queue

    >> implementation. In the exact matching case, tools will have no

    problem

    >> matching up sends and recvs.

    This is true. If the tools can identify this scenario, it could be

    supported by current tools without significant change. However, as it

    is not generally forbidden to have inexact matching (right?), it is

    unclear on how the tools would detect this.

    What about an additional info key a user can set in this respect:

    exact_matching => true/false

    in which the user can state whether it is indeed a scenario of exact

    matching or not. The tool could check this, and issue a warning.

    >> If tools cannot handle MPI_THREAD_MULTIPLE already, then I

    don't really

    >> care if they can't support this assertion either.

    Not handling MPI_THREAD_MULTIPLE generally is not carved in stone. ;-)

    As I said, we (Markus and I) see this as a trigger to come to a viable

    solution for tools like ours to support either situation.

    >> And in any case, such tools can just intercept the info

    operations and

    >> strip this key if they can't support it.

    As I wrote above in reply to Dan, stripping options that influence

    behavior is not a good option. I, personally, would rather bail out

    than (silently) change messaging semantics. I can't say what Markus'

    take on this is.

    Jim wrote:

    > I don't really see any necessary fix to the proposal. We could

    add an

    > advice to users to remind them that they should ensure tools are

    compatible

    > with the info keys. And the reverse advice to tools writers that

    they

    > should check info keys for compatibility.

    I would second this idea, while emphasizing the burden to be on the

    tool to check for this info key (and potentially others) and warn the

    user of "undersupport".

    Cheers,

    Marc-Andre

    --

    Marc-Andre Hermanns

    J�lich Aachen Research Alliance,

    High Performance Computing (JARA-HPC)

    J�lich Supercomputing Centre (JSC)

    Schinkelstrasse 2

    52062 Aachen

    Germany

    Phone: +49 2461 61 2509 | +49 241 80 24381

    Fax: +49 2461 80 6 99753

    <a class="moz-txt-link-abbreviated" href="http://www.jara.org/jara-hpc">www.jara.org/jara-hpc</a> <a class="moz-txt-link-rfc2396E" href="http://www.jara.org/jara-hpc"><http://www.jara.org/jara-hpc></a>

    email: <a class="moz-txt-link-abbreviated" href="mailto:hermanns@jara.rwth-aachen.de">hermanns@jara.rwth-aachen.de</a>

    <a class="moz-txt-link-rfc2396E" href="mailto:hermanns@jara.rwth-aachen.de"><mailto:hermanns@jara.rwth-aachen.de></a>

    _______________________________________________

    mpiwg-p2p mailing list

    <a class="moz-txt-link-abbreviated" href="mailto:mpiwg-p2p@lists.mpi-forum.org">mpiwg-p2p@lists.mpi-forum.org</a> <a class="moz-txt-link-rfc2396E" href="mailto:mpiwg-p2p@lists.mpi-forum.org"><mailto:mpiwg-p2p@lists.mpi-forum.org></a>

    <a class="moz-txt-link-freetext" href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p</a>

-- 

Jeff Hammond

<a class="moz-txt-link-abbreviated" href="mailto:jeff.science@gmail.com">jeff.science@gmail.com</a> <a class="moz-txt-link-rfc2396E" href="mailto:jeff.science@gmail.com"><mailto:jeff.science@gmail.com></a>

<a class="moz-txt-link-freetext" href="http://jeffhammond.github.io/">http://jeffhammond.github.io/</a>

_______________________________________________

mpiwg-p2p mailing list

<a class="moz-txt-link-abbreviated" href="mailto:mpiwg-p2p@lists.mpi-forum.org">mpiwg-p2p@lists.mpi-forum.org</a>

<a class="moz-txt-link-freetext" href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p</a>

</pre>

      </blockquote>

      <pre wrap="">

</pre>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

mpiwg-p2p mailing list

<a class="moz-txt-link-abbreviated" href="mailto:mpiwg-p2p@lists.mpi-forum.org">mpiwg-p2p@lists.mpi-forum.org</a>

<a class="moz-txt-link-freetext" href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-p2p</a></pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Dan Holmes

Applications Consultant in HPC Research

EPCC, The University of Edinburgh

James Clerk Maxwell Building

The Kings Buildings

Peter Guthrie Tait Road 

Edinburgh

EH9 3FD

T: +44(0)131 651 3465

E: <a class="moz-txt-link-abbreviated" href="mailto:dholmes@epcc.ed.ac.uk">dholmes@epcc.ed.ac.uk</a>

*Please consider the environment before printing this email.*</pre>

  </body>

</html>