[Mpi3-tools] Current Version of the MPIR document

Dong H. Ahn ahn1 at llnl.gov
Mon Jun 21 17:17:39 CDT 2010

Hi Martin, John and all,

I really like the idea of publishing this as an official document 
through the MPI forum. I've helped putting together several machine 
RFPs, and I've always been hoping to have this kind of document to point 
to describe this required interface.

Here are some questions and comments:

- Section 6.2
      I recently worked with a SLURM developer to implement 
MPIR_partial_attach_ok support within SLURM and pointed a draft of this 
document to him. In in regard to "When an implementation uses a 
synchronization technique that does not use the MPIR_debug_gate, and 
does not require the tool to attach to and continue the MPI process, it 
should define the symbol MPIR_partial_attach_ok (§9.13) in the starter 
process, and avoid defining MPIR_debug_gate in the MPI processes."

      We found that this "avoid defining MPIR_debug_gate in the MPI 
processes" is too strict. Resource management software layers like SLURM 
would want to support MPIR_partial_attach_ok without having to modify 
MPI binaries which may already define MPIR_debug_gate. But "undefining" 
an already defined symbol like this within an MPI binary isn't trivial. 
Could you consider changing this requirement to "avoid *using* 
MPIR_debug_gate in the MPI processes?" This is to say having the 
definition of MPIR_debug_gate within MPI is OK as far as it's not *used* 
for the synchronization?

      With respect to this consideration, though, I think it would also 
be good to be clearer about the relationship between 
MPIR_partial_attach_ok and MPIR_debug_gate described in Section 9.13 as 
well. In particular, the requirement,

      "The tool may choose to ignore the presence of the 
MPIR_partial_attach_ok symbol and acquire all MPI rank processes. The 
presence of this symbol does not prevent the tool from using the MPIR 
synchronization technique to acquire all of the processes, if it so 
chooses, because setting the MPIR_debug_gate variable (if present) is 

      is confusing. Shouldn't setting the MPIR_debug_gate variable" is 
NOOP under this condition? If so, how could the tool acquire the 
processes using this technique?

- Section 7.1
      In this case, a potential race condition can occur as the tool 
cannot precisely control when those daemons are launched especially 
under a synchronization scheme like BlueGene's: The job may start 
running before daemons are launched. To minimize this kind of problem, 
can we be explicit about the requirement as to when the daemons should 
be launched in relation to when the job is loaded and run? On BlueGene, 
if the job is loaded and run before daemons are launched onto the I/O 
nodes and issue (pre)ATTACH commands to the compute node processes, the 
daemons won't be able to acquire the processes before the processes run 
away from their initial instructions.

- Section 9.2
      I like that this document encourages the MPI implementation to 
"share the host and executable name character strings across
multiple process descriptor entries whenever possible." The section 
states that this would improve tools' performance at scale. But our 
experience has been that it would improve RM's scalability as RMs 
themselves often consist of distributed components and the bigger 
strings means larger communication overheads within RM's 
inter-communication. And the impact gets magnified at extreme scale. So, 
can you consider saying to the effect that "not sharing these character 
strings have been a major scalability problem within the RMs themselves 
as well."

- Section 9.13
      In addition to the requirement that I mentioned in relation to 
Section 6.2 above,  can you be more explicit about "the tool need only 
release the starter process to release the whole MPI job, which can 
therefore be run without requiring the tool to acquire all of the MPI 
processes included in the MPI job." Perhaps, the tool need only to 
release the starter process from MPIR_Breakpoint?


On 6/14/2010 9:27 AM, Martin Schulz wrote:
> Hi all,
> Attached is the latest and updated version of the MPIR document, which John
> DelSignore put together. The intent is still to publish this through the MPI forum
> as an official document. The details for this are still tbd. and Jeff will lead a
> discussion on this topic during the forum this week.
> We don't have a tools WG meeting scheduled for meeting, but if you have
> any comments or feedback (on the document or how we should publish it),
> please post it to the list. If necessary or useful, we can also dedicate one
> of the upcoming tools TelCons for this.
> Thanks!
> Martin
> PS: Feel free to distribute the document further, in particular to tool and
> MPI developers.

More information about the mpiwg-tools mailing list