[mpiwg-tools] Reading of ticket #484
Jeff Squyres (jsquyres)
jsquyres at cisco.com
Fri Oct 9 06:21:02 CDT 2015
I'm sorry I had to miss the call yesterday -- it sounds like you talked through these issues.
Note that I filed a pull request last night with the changes that we discussed in Bordeaux:
and its corresponding public issue:
And I see in the minutes from yesterday that Anh is making the changes you discussed yesterday.
Ahn: it would be better if you either filed a pull request against my repo, and then I can suck your changes into the already-existing PR. See https://github.com/mpiwg-tools/mpir/pull/1#issuecomment-146837616.
> On Oct 1, 2015, at 8:58 AM, John DelSignore <John.DelSignore at roguewave.com> wrote:
> I'll be on a plane on 10/8, so I won't be able to make the call. Here are my comments (I haven't joined the MPI tool github thingy yet):
>> • Martin pointed out that "the" may be ambiguous and proposed to use either "all" or "some" to avoid this ambiguity
>> • current definition allows only some of the processes to have the symbol
>> • in general the new wording should not restrict this
>> • might this be problematic for debuggers?
> I think the intent is:
> • "the" MPI starter process (e.g., mpiexec, orterun, srun, aprun, etc.), if there is one. MPICH1 doesn't have a separate starter process and TotalView still supports it, but I doubt there are any MPICH1 users anymore.
> • "all" MPI processes at the time the MPIR_DEBUG_SPAWNED event is raised.
> Here's why... If not all MPI processes define the symbol within the same image file (executable or shared library), it could be problematic for the debugger. TotalView does not currently set MPIR_being_debugged in the MPI processes (so this will have to change), but it does set MPIR_debug_gate in all MPI processes that define it. For the debug gate variable the TotalView client will lookup the symbol in a representative process from each unique executable in the program (the "share group"), and broadcast a write request with a "segment plus offset" relocatable address for each share group. The TotalView servers attempt to relocate that address in each MPI process, but if the process does not load the segment containing the variable the server skips it.
> So, the problem I can imagine here has to do with the variable being defined in a shared library and not all processes having the library loaded at the time the MPIR_DEBUG_SPAWNED event is raised. Also, if the library defining the symbol is loaded later, the debugger might not catch that event and set the variable.
>> • We need to specify that we set value to 1/0 in the process to which we attach/from which we detach
>> • Do we allow only 0 and 1 or zero and non-zero values?
> TotalView uses 0 and 1. It seems to me that other non-zero values might be a problem for the MPI implementation.
>> • Do we need to be more specific on when the debugger sets the value back to 0?
> I think we said that it is set to 0 before detaching from the MPI process. Is that not specific enough? If we are going to kill the job, there should be no need to set the variable to 0.
>> • We are unspecific about what happens to the value between attach and detach. Do we need to be clearer?
> I think that the MPI implementation is allowed to test the value. I guess I don't see why it can't also modify it's value if it suits its purposes.
> Cheers, John D.
> Marc-Andre Hermanns wrote:
>> Dear all,
>> there were several comments and modification requests during our
>> reading of ticket #484 that will require another reading.
>> I put up the notes from the reading at:
>> Kathryn and I would like to discuss this at the next call on Oct 8, 2015.
>> The most pressing question in advance is that we think about where the
>> symbol _needs_ to be defined, if at all. The current definition is a
>> little ambiguous. As the variable is optional, the common
>> understanding during the discussion was that it does not have to be
>> available in _every_ process. Does it lead to complications for the
>> Debuggers (bookkeeping, etc.) if some MPI processes have the symbol
>> and others do not? Should we rather have a "all or none" semantic?
>> It would be great, if we could discuss this prior to next week, so we
>> can finalize the wording on this ticket during the call.
>> mpiwg-tools mailing list
>> mpiwg-tools at lists.mpi-forum.org
> mpiwg-tools mailing list
> mpiwg-tools at lists.mpi-forum.org
jsquyres at cisco.com
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
More information about the mpiwg-tools