<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
Hi,<br>
<br>
I'll be on a plane on 10/8, so I won't be able to make the call. Here
are my comments (I haven't joined the MPI tool github thingy yet):<br>
<blockquote type="cite">
<ul>
<li>Martin pointed out that "the" may be ambiguous and proposed to
use either "all" or "some" to avoid this ambiguity
<ul>
<li>current definition allows only some of the processes to
have the symbol</li>
<li>in general the new wording should not restrict this</li>
<li>might this be problematic for debuggers?</li>
</ul>
</li>
</ul>
</blockquote>
<br>
I think the intent is:<br>
<ul>
<li>"the" MPI <u>starter</u> process (e.g., mpiexec, orterun, srun,
aprun, etc.), if there is one. MPICH1 doesn't have a separate starter
process and TotalView still supports it, but I doubt there are any
MPICH1 users anymore.</li>
<li>"all" MPI processes at the time the MPIR_DEBUG_SPAWNED event is
raised.</li>
</ul>
Here's why... If not all MPI processes define the symbol within the
same image file (executable or shared library), it could be problematic
for the debugger. TotalView does not currently set MPIR_being_debugged
in the MPI processes (so this will have to change), but it does set
MPIR_debug_gate in all MPI processes that define it. For the debug gate
variable the TotalView client will lookup the symbol in a
representative process from each unique executable in the program (the
"share group"), and broadcast a write request with a "segment plus
offset" relocatable address for each share group. The TotalView servers
attempt to relocate that address in each MPI process, but if the
process does not load the segment containing the variable the server
skips it.<br>
<br>
So, the problem I can imagine here has to do with the variable being
defined in a shared library and not all processes having the library
loaded at the time the MPIR_DEBUG_SPAWNED event is raised. Also, if the
library defining the symbol is loaded later, the debugger might not
catch that event and set the variable.<br>
<br>
<blockquote type="cite"><br>
<ul>
<li>We need to specify that we set value to 1/0 in the process to
which we attach/from which we detach</li>
<li>Do we allow only 0 and 1 or zero and non-zero values?</li>
</ul>
</blockquote>
<br>
TotalView uses 0 and 1. It seems to me that other non-zero values might
be a problem for the MPI implementation.<br>
<br>
<blockquote type="cite">
<ul>
<li>Do we need to be more specific on when the debugger sets the
value back to 0?</li>
</ul>
</blockquote>
<br>
I think we said that it is set to 0 before detaching from the MPI
process. Is that not specific enough? If we are going to kill the job,
there should be no need to set the variable to 0.<br>
<br>
<blockquote type="cite">
<ul>
<li>We are unspecific about what happens to the value between
attach and detach. Do we need to be clearer?</li>
</ul>
</blockquote>
<br>
I think that the MPI implementation is allowed to test the value. I
guess I don't see why it can't also modify it's value if it suits its
purposes.<br>
<br>
Cheers, John D.<br>
<br>
<br>
Marc-Andre Hermanns wrote:
<blockquote cite="mid:560CDBB1.6060503@jara.rwth-aachen.de" type="cite">
<pre wrap="">Dear all,
there were several comments and modification requests during our
reading of ticket #484 that will require another reading.
I put up the notes from the reading at:
<a class="moz-txt-link-freetext" href="https://github.com/mpiwg-tools/tools-issues/wiki/Notes-2015-09-24">https://github.com/mpiwg-tools/tools-issues/wiki/Notes-2015-09-24</a>
Kathryn and I would like to discuss this at the next call on Oct 8, 2015.
The most pressing question in advance is that we think about where the
symbol _needs_ to be defined, if at all. The current definition is a
little ambiguous. As the variable is optional, the common
understanding during the discussion was that it does not have to be
available in _every_ process. Does it lead to complications for the
Debuggers (bookkeeping, etc.) if some MPI processes have the symbol
and others do not? Should we rather have a "all or none" semantic?
It would be great, if we could discuss this prior to next week, so we
can finalize the wording on this ticket during the call.
Cheers,
Marc-Andre
</pre>
<pre wrap="">
<hr size="4" width="90%">
_______________________________________________
mpiwg-tools mailing list
<a class="moz-txt-link-abbreviated" href="mailto:mpiwg-tools@lists.mpi-forum.org">mpiwg-tools@lists.mpi-forum.org</a>
<a class="moz-txt-link-freetext" href="http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools">http://lists.mpi-forum.org/mailman/listinfo.cgi/mpiwg-tools</a></pre>
</blockquote>
</body>
</html>