[Mpi3-tools] Current Version of the MPIR document

Wed Jun 30 07:56:39 CDT 2010

Hi Dong,

My comments in-line below...

Dong H. Ahn wrote:
> Hi Martin, John and all,
> 
> I really like the idea of publishing this as an official document 
> through the MPI forum. I've helped putting together several machine 
> RFPs, and I've always been hoping to have this kind of document to point 
> to describe this required interface.
> 
> Here are some questions and comments:
> 
> - Section 6.2
>       I recently worked with a SLURM developer to implement 
> MPIR_partial_attach_ok support within SLURM and pointed a draft of this 
> document to him. In in regard to "When an implementation uses a 
> synchronization technique that does not use the MPIR_debug_gate, and 
> does not require the tool to attach to and continue the MPI process, it 
> should define the symbol MPIR_partial_attach_ok (§9.13) in the starter 
> process, and avoid defining MPIR_debug_gate in the MPI processes."
> 
>       We found that this "avoid defining MPIR_debug_gate in the MPI 
> processes" is too strict. Resource management software layers like SLURM 
> would want to support MPIR_partial_attach_ok without having to modify 
> MPI binaries which may already define MPIR_debug_gate. But "undefining" 
> an already defined symbol like this within an MPI binary isn't trivial. 
> Could you consider changing this requirement to "avoid *using* 
> MPIR_debug_gate in the MPI processes?" This is to say having the 
> definition of MPIR_debug_gate within MPI is OK as far as it's not *used* 
> for the synchronization?

My motivation for using the words "should ... avoid defining MPIR_debug_gate" is so that it is unambiguous whether or not the tool needs to set MPIR_debug_gate to 1. It's mostly a matter of efficiency: if MPIR_debug_gate is not defined, then clearly the tool cannot set it. At least for TotalView, it is very cheap check because it boils down to a client-side symbol table lookup. But...

If MPIR_debug_gate is defined, should the tool set it or not? The tool has no way of knowing if MPIR_debug_gate is being *used*. If there are MPI implementations that define MPIR_partial_attach_ok and depend on MPIR_debug_gate being set to 1 by the tool, then the safest thing the tool can do is to set it if it is defined.

What TotalView does (and always has done) is to set MPIR_debug_gate to 1 if it is defined, regardless of whether or not MPIR_partial_attach_ok is defined. It would be easy enough to change this behavior for the sake of efficiency, but I have no idea what that would break. Assuming that we want the interface to handle those types of MPI implementations, then I think we need to err on the side of safety and say *defined*, not *used*.

One things that is not clear to me is how SLURM controls whether or not an MPI implementation uses MPIR_debug_gate. IIRC, SLURM is using the tracing interface (e.g., ptrace()) to create the process in a stopped state. Is it also doing a symbol table lookup of MPIR_debug_gate and setting it to 1 itself?

In any case, does the following wording work for you?

"When an implementation uses a synchronization technique that does not require the tool to set MPIR_debug_gate, and does not require the tool to attach to and continue the MPI process, it should define the symbol MPIR_partial_attach_ok (§9.13) in the starter process. If possible an MPI implementation that does not require the tool to set MPIR_debug_gate should avoid defining MPIR_debug_gate in the MPI processes."

>       With respect to this consideration, though, I think it would also 
> be good to be clearer about the relationship between 
> MPIR_partial_attach_ok and MPIR_debug_gate described in Section 9.13 as 
> well. In particular, the requirement,
> 
>       "The tool may choose to ignore the presence of the 
> MPIR_partial_attach_ok symbol and acquire all MPI rank processes. The 
> presence of this symbol does not prevent the tool from using the MPIR 
> synchronization technique to acquire all of the processes, if it so 
> chooses, because setting the MPIR_debug_gate variable (if present) is 
> harmless."
> 
>       is confusing.

Yes, it is confusing. I was trying to convey the following from the mpich-attach.txt document. Look here, line 157: http://www.mcs.anl.gov/research/projects/mpi/mpi-debug/mpich-attach.txt

"
If the symbol MPIR_partial_attach_ok is present in the executable,
then this informs TotalView that the initial startup barrier is
implemented by the MPI system, rather than by having each of the child
processes hang in a loop waiting for the MPIR_debug_gate variable to
be set. Therefore TotalView need only release the initial process to
release the whole MPI job, which can therefore be run _without_ having
to acquire all of the MPI processes which it includes. This is useful
in versions of TotalView which include the possibility of attaching to
processes later in the run (for instance, by selecting only processes
in a specific communicator, or a specific rank process in COMM_WORLD).
TotalView may choose to ignore this and acquire all processes, and its
presence does not prevent TotalView from using the old protocol to
acquire all of the processes. (Since setting the MPIR_debug_gate is
harmless). 
"

MPIR_partial_attach_ok was add well after the MPIR interface had been used for years. I *think* that the intention was that introducing MPIR_partial_attach_ok would not cause old versions of tools (probably just TotalView at the time) that were not aware of MPIR_partial_attach_ok to stop working. As more and more MPIs implemented MPIR_partial_attach_ok, we wanted old versions of TotalView to keep working.

I think that all it is saying is that the tool is not required to honor MPIR_partial_attach_ok and if MPIR_debug_gate is defined the tool might set it anyway even if the MPI implementation does not require it.

> Shouldn't setting the MPIR_debug_gate variable" is 
> NOOP under this condition?

Yes.

> If so, how could the tool acquire the 
> processes using this technique?

I'm not sure I understand the question. Do you mean, "how could the tool *synchronize* startup with the processes?" The answer is that the tool wouldn't need to do anything special to synchronize startup because the MPI implementation is handling it. The tool might *think* it needs to by setting MPIR_debug_gate (if defined) but that would have no effect.

> - Section 7.1
>       In this case, a potential race condition can occur as the tool 
> cannot precisely control when those daemons are launched especially 
> under a synchronization scheme like BlueGene's: The job may start 
> running before daemons are launched.

No, that's not how it works. When the MPI job is launched under the control of the debugger on Blue Gene, the servers are launched and the MPI runtime system arranges for the processes to be created in a stopped state. So, the debugger has an indefinite amount of time to connect to the servers and issue attach requests. As best I can tell, the processes are just an empty execution contexts containing no threads. After the debugger issues the CIOD ATTACH request, the process and initial thread is fully created, and it stops with a single-step event. When the debugger see the event, it knows that the attach is complete.

> To minimize this kind of problem, 
> can we be explicit about the requirement as to when the daemons should 
> be launched in relation to when the job is loaded and run?

Sure, I like explicit. What would you like it to say?

> On BlueGene, 
> if the job is loaded and run before daemons are launched onto the I/O 
> nodes and issue (pre)ATTACH commands to the compute node processes, the 
> daemons won't be able to acquire the processes before the processes run 
> away from their initial instructions.

Like I said above, there is no danger of the processes running away on Blue Gene. In fact on Blue Gene after launching an MPI job under the debugger, only 1 user-mode instruction at "_start" is executed, and the initial thread's PC is at "_start+4".

> - Section 9.2
>       I like that this document encourages the MPI implementation to 
> "share the host and executable name character strings across
> multiple process descriptor entries whenever possible."

Yes, we saw a real-life situation where an MPI implementation didn't share the strings and that hurt scalability a lot because the debugger was spending a lot of time reading redundant strings through the tracing (ptrace()) interface which is expensive. If the strings are shared, the debugger data cache avoids the tracing interface.

> The section 
> states that this would improve tools' performance at scale.

In the particular case we were looking at, it improved TotalView's performance at scale because the time to fetch the MPIR proctable contents was dramatically improved.

> But our 
> experience has been that it would improve RM's scalability as RMs 
> themselves often consist of distributed components and the bigger 
> strings means larger communication overheads within RM's 
> inter-communication. And the impact gets magnified at extreme scale. So, 
> can you consider saying to the effect that "not sharing these character 
> strings have been a major scalability problem within the RMs themselves 
> as well."

I see your point, but I'm not sure that we're talking about a problem that is relevant to the MPIR interface. In your case, the RM has to make sure that it sends MPIR the information back to the MPI starter in an efficient manner, possibly by using reduction trees. Once the MPI starter has the information in whatever form it chooses, it has to populate the MPIR procdesc table such that the tool can read the table and the strings pointed to by the table entries through the tracing interface as efficiently as possible.

I think the RM's problem of gathering the MPIR information is one level removed from the MPIR interface, so I'm not sure that your statement belongs. But, I don't feel strongly about it one way or another, so whatever the group decides is fine with me.

> - Section 9.13
>       In addition to the requirement that I mentioned in relation to 
> Section 6.2 above,  can you be more explicit about "the tool need only 
> release the starter process to release the whole MPI job, which can 
> therefore be run without requiring the tool to acquire all of the MPI 
> processes included in the MPI job." Perhaps, the tool need only to 
> release the starter process from MPIR_Breakpoint?

Good point. It should say that the tool is releasing the MPI starter process from the MPIR_DEBUG_SPAWNED event.

Cheers, John D.

> Best,
> Dong
> 
> On 6/14/2010 9:27 AM, Martin Schulz wrote:
>> Hi all,
>>
>> Attached is the latest and updated version of the MPIR document, which John
>> DelSignore put together. The intent is still to publish this through the MPI forum
>> as an official document. The details for this are still tbd. and Jeff will lead a
>> discussion on this topic during the forum this week.
>>
>> We don't have a tools WG meeting scheduled for meeting, but if you have
>> any comments or feedback (on the document or how we should publish it),
>> please post it to the list. If necessary or useful, we can also dedicate one
>> of the upcoming tools TelCons for this.
>>
>> Thanks!
>>
>> Martin
>>
>> PS: Feel free to distribute the document further, in particular to tool and
>> MPI developers.
>>
>>
>>
>>    
> 
> _______________________________________________
> Mpi3-tools mailing list
> Mpi3-tools at lists.mpi-forum.org
> http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
>