[mpiwg-tools] Reading of ticket #484

Marc-Andre Hermanns hermanns at jara.rwth-aachen.de
Fri Oct 2 10:45:17 CDT 2015


Hi John,

> I'll be on a plane on 10/8, so I won't be able to make the call.

Good to know. So let's try to clear everything beforehand.

> Here are my comments (I haven't joined the MPI tool github thingy
> yet):

Once you have a Github account, let me know and I will add you to the
respective groups.

>>   * Martin pointed out that "the" may be ambiguous and proposed to
>>     use either "all" or "some" to avoid this ambiguity
>>       o current definition allows only some of the processes to have
>>         the symbol
>>       o in general the new wording should not restrict this
>>       o might this be problematic for debuggers?
>>
> 
> I think the intent is:
> 
>   * "the" MPI _starter_ process (e.g., mpiexec, orterun, srun, aprun,
>     etc.), if there is one. MPICH1 doesn't have a separate starter
>     process and TotalView still supports it, but I doubt there are any
>     MPICH1 users anymore.
>   * "all" MPI processes at the time the MPIR_DEBUG_SPAWNED event is
>     raised.
> 
> Here's why... If not all MPI processes define the symbol within the
> same image file (executable or shared library), it could be
> problematic for the debugger. TotalView does not currently set
> MPIR_being_debugged in the MPI processes (so this will have to
> change), but it does set MPIR_debug_gate in all MPI processes that
> define it. For the debug gate variable the TotalView client will
> lookup the symbol in a representative process from each unique
> executable in the program (the "share group"), and broadcast a write
> request with a "segment plus offset" relocatable address for each
> share group. The TotalView servers attempt to relocate that address in
> each MPI process, but if the process does not load the segment
> containing the variable the server skips it.
> 
> So, the problem I can imagine here has to do with the variable being
> defined in a shared library and not all processes having the library
> loaded at the time the MPIR_DEBUG_SPAWNED event is raised. Also, if
> the library defining the symbol is loaded later, the debugger might
> not catch that event and set the variable.

So do I read this correctly that we actually have to be more
restrictive/clearer in the text:

If any MPI process defines it, it should be defined in the same shared
library and should be available with MPIR_debug_gate.

Would this cover your concerns?

>>   * We need to specify that we set value to 1/0 in the process to
>>     which we attach/from which we detach
>>   * Do we allow only 0 and 1 or zero and non-zero values?
> 
> TotalView uses 0 and 1. It seems to me that other non-zero values
> might be a problem for the MPI implementation.

Ok.

>>   * Do we need to be more specific on when the debugger sets the
>>     value back to 0?
> 
> I think we said that it is set to 0 before detaching from the MPI
> process. Is that not specific enough? If we are going to kill the job,
> there should be no need to set the variable to 0.

The question to this arose that "before detach" may also cover longer
time spans than "right before detach". Maybe the concern was that it
sets it back to 0 and continues access debug information before
detaching and this to cause problems.

>>   * We are unspecific about what happens to the value between attach
>>     and detach. Do we need to be clearer?
> 
> I think that the MPI implementation is allowed to test the value. I
> guess I don't see why it can't also modify it's value if it suits its
> purposes.

In the text we currently only allow reading for the MPI
implementation. At least that is how I interpret the text. The
question probably intended to clarify: Is the debugger allowed to
change the value at other points that attachment and detachment?

An MPI implementation is also allowed to ignore the variable, right?
So setting this variable does not guarantee anything to the debugger?

Cheers,
Marc-Andre

-- 
Marc-Andre Hermanns
Jülich Aachen Research Alliance,
High Performance Computing (JARA-HPC)
Jülich Supercomputing Centre (JSC)

Schinkelstrasse 2
52062 Aachen
Germany

Phone: +49 2461 61 2509 | +49 241 80 24897
Fax: +49 2461 80 6 99753
www.jara.org/jara-hpc
email: hermanns at jara.rwth-aachen.de



-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4899 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20151002/7726ca80/attachment-0001.bin>


More information about the mpiwg-tools mailing list