[Mpi3-tools] Current Version of the MPIR document
Dong H. Ahn
ahn1 at llnl.gov
Mon Jun 21 17:17:39 CDT 2010
Hi Martin, John and all,
I really like the idea of publishing this as an official document
through the MPI forum. I've helped putting together several machine
RFPs, and I've always been hoping to have this kind of document to point
to describe this required interface.
Here are some questions and comments:
- Section 6.2
I recently worked with a SLURM developer to implement
MPIR_partial_attach_ok support within SLURM and pointed a draft of this
document to him. In in regard to "When an implementation uses a
synchronization technique that does not use the MPIR_debug_gate, and
does not require the tool to attach to and continue the MPI process, it
should define the symbol MPIR_partial_attach_ok (§9.13) in the starter
process, and avoid defining MPIR_debug_gate in the MPI processes."
We found that this "avoid defining MPIR_debug_gate in the MPI
processes" is too strict. Resource management software layers like SLURM
would want to support MPIR_partial_attach_ok without having to modify
MPI binaries which may already define MPIR_debug_gate. But "undefining"
an already defined symbol like this within an MPI binary isn't trivial.
Could you consider changing this requirement to "avoid *using*
MPIR_debug_gate in the MPI processes?" This is to say having the
definition of MPIR_debug_gate within MPI is OK as far as it's not *used*
for the synchronization?
With respect to this consideration, though, I think it would also
be good to be clearer about the relationship between
MPIR_partial_attach_ok and MPIR_debug_gate described in Section 9.13 as
well. In particular, the requirement,
"The tool may choose to ignore the presence of the
MPIR_partial_attach_ok symbol and acquire all MPI rank processes. The
presence of this symbol does not prevent the tool from using the MPIR
synchronization technique to acquire all of the processes, if it so
chooses, because setting the MPIR_debug_gate variable (if present) is
is confusing. Shouldn't setting the MPIR_debug_gate variable" is
NOOP under this condition? If so, how could the tool acquire the
processes using this technique?
- Section 7.1
In this case, a potential race condition can occur as the tool
cannot precisely control when those daemons are launched especially
under a synchronization scheme like BlueGene's: The job may start
running before daemons are launched. To minimize this kind of problem,
can we be explicit about the requirement as to when the daemons should
be launched in relation to when the job is loaded and run? On BlueGene,
if the job is loaded and run before daemons are launched onto the I/O
nodes and issue (pre)ATTACH commands to the compute node processes, the
daemons won't be able to acquire the processes before the processes run
away from their initial instructions.
- Section 9.2
I like that this document encourages the MPI implementation to
"share the host and executable name character strings across
multiple process descriptor entries whenever possible." The section
states that this would improve tools' performance at scale. But our
experience has been that it would improve RM's scalability as RMs
themselves often consist of distributed components and the bigger
strings means larger communication overheads within RM's
inter-communication. And the impact gets magnified at extreme scale. So,
can you consider saying to the effect that "not sharing these character
strings have been a major scalability problem within the RMs themselves
- Section 9.13
In addition to the requirement that I mentioned in relation to
Section 6.2 above, can you be more explicit about "the tool need only
release the starter process to release the whole MPI job, which can
therefore be run without requiring the tool to acquire all of the MPI
processes included in the MPI job." Perhaps, the tool need only to
release the starter process from MPIR_Breakpoint?
On 6/14/2010 9:27 AM, Martin Schulz wrote:
> Hi all,
> Attached is the latest and updated version of the MPIR document, which John
> DelSignore put together. The intent is still to publish this through the MPI forum
> as an official document. The details for this are still tbd. and Jeff will lead a
> discussion on this topic during the forum this week.
> We don't have a tools WG meeting scheduled for meeting, but if you have
> any comments or feedback (on the document or how we should publish it),
> please post it to the list. If necessary or useful, we can also dedicate one
> of the upcoming tools TelCons for this.
> PS: Feel free to distribute the document further, in particular to tool and
> MPI developers.
More information about the mpiwg-tools