[Mpi3-tools] Current Version of the MPIR document

Dong H. Ahn ahn1 at llnl.gov
Wed Jun 30 16:12:18 CDT 2010


Hi John,

Comments are in-lined below.

On 6/30/2010 5:56 AM, John DelSignore wrote:
> Hi Dong,
>
> My comments in-line below...
>
> Dong H. Ahn wrote:
>    
>> Hi Martin, John and all,
>>
>> I really like the idea of publishing this as an official document
>> through the MPI forum. I've helped putting together several machine
>> RFPs, and I've always been hoping to have this kind of document to point
>> to describe this required interface.
>>
>> Here are some questions and comments:
>>
>> - Section 6.2
>>        I recently worked with a SLURM developer to implement
>> MPIR_partial_attach_ok support within SLURM and pointed a draft of this
>> document to him. In in regard to "When an implementation uses a
>> synchronization technique that does not use the MPIR_debug_gate, and
>> does not require the tool to attach to and continue the MPI process, it
>> should define the symbol MPIR_partial_attach_ok (§9.13) in the starter
>> process, and avoid defining MPIR_debug_gate in the MPI processes."
>>
>>        We found that this "avoid defining MPIR_debug_gate in the MPI
>> processes" is too strict. Resource management software layers like SLURM
>> would want to support MPIR_partial_attach_ok without having to modify
>> MPI binaries which may already define MPIR_debug_gate. But "undefining"
>> an already defined symbol like this within an MPI binary isn't trivial.
>> Could you consider changing this requirement to "avoid *using*
>> MPIR_debug_gate in the MPI processes?" This is to say having the
>> definition of MPIR_debug_gate within MPI is OK as far as it's not *used*
>> for the synchronization?
>>      
> My motivation for using the words "should ... avoid defining MPIR_debug_gate" is so that it is unambiguous whether or not the tool needs to set MPIR_debug_gate to 1. It's mostly a matter of efficiency: if MPIR_debug_gate is not defined, then clearly the tool cannot set it. At least for TotalView, it is very cheap check because it boils down to a client-side symbol table lookup. But...
>
> If MPIR_debug_gate is defined, should the tool set it or not? The tool has no way of knowing if MPIR_debug_gate is being *used*. If there are MPI implementations that define MPIR_partial_attach_ok and depend on MPIR_debug_gate being set to 1 by the tool, then the safest thing the tool can do is to set it if it is defined.
>
> What TotalView does (and always has done) is to set MPIR_debug_gate to 1 if it is defined, regardless of whether or not MPIR_partial_attach_ok is defined. It would be easy enough to change this behavior for the sake of efficiency, but I have no idea what that would break. Assuming that we want the interface to handle those types of MPI implementations, then I think we need to err on the side of safety and say *defined*, not *used*.
>
> One things that is not clear to me is how SLURM controls whether or not an MPI implementation uses MPIR_debug_gate. IIRC, SLURM is using the tracing interface (e.g., ptrace()) to create the process in a stopped state. Is it also doing a symbol table lookup of MPIR_debug_gate and setting it to 1 itself?
>    

AFAIK, MPIR_debug_gate has always been NOOP under SLURM. The concern was 
that they don't know how to undefine the already defined MPIR_debug_gate 
symbol if its absence is required in providing MPIR_partial_attach_ok 
support. The following wording changes that you suggested should address 
that concern.

> In any case, does the following wording work for you?
>
> "When an implementation uses a synchronization technique that does not require the tool to set MPIR_debug_gate, and does not require the tool to attach to and continue the MPI process, it should define the symbol MPIR_partial_attach_ok (§9.13) in the starter process. If possible an MPI implementation that does not require the tool to set MPIR_debug_gate should avoid defining MPIR_debug_gate in the MPI processes."
>
>    
>>        With respect to this consideration, though, I think it would also
>> be good to be clearer about the relationship between
>> MPIR_partial_attach_ok and MPIR_debug_gate described in Section 9.13 as
>> well. In particular, the requirement,
>>
>>        "The tool may choose to ignore the presence of the
>> MPIR_partial_attach_ok symbol and acquire all MPI rank processes. The
>> presence of this symbol does not prevent the tool from using the MPIR
>> synchronization technique to acquire all of the processes, if it so
>> chooses, because setting the MPIR_debug_gate variable (if present) is
>> harmless."
>>
>>        is confusing.
>>      
> Yes, it is confusing. I was trying to convey the following from the mpich-attach.txt document. Look here, line 157: http://*www.*mcs.anl.gov/research/projects/mpi/mpi-debug/mpich-attach.txt
>
> "
> If the symbol MPIR_partial_attach_ok is present in the executable,
> then this informs TotalView that the initial startup barrier is
> implemented by the MPI system, rather than by having each of the child
> processes hang in a loop waiting for the MPIR_debug_gate variable to
> be set. Therefore TotalView need only release the initial process to
> release the whole MPI job, which can therefore be run _without_ having
> to acquire all of the MPI processes which it includes. This is useful
> in versions of TotalView which include the possibility of attaching to
> processes later in the run (for instance, by selecting only processes
> in a specific communicator, or a specific rank process in COMM_WORLD).
> TotalView may choose to ignore this and acquire all processes, and its
> presence does not prevent TotalView from using the old protocol to
> acquire all of the processes. (Since setting the MPIR_debug_gate is
> harmless).
> "
>
> MPIR_partial_attach_ok was add well after the MPIR interface had been used for years. I *think* that the intention was that introducing MPIR_partial_attach_ok would not cause old versions of tools (probably just TotalView at the time) that were not aware of MPIR_partial_attach_ok to stop working. As more and more MPIs implemented MPIR_partial_attach_ok, we wanted old versions of TotalView to keep working.
>
> I think that all it is saying is that the tool is not required to honor MPIR_partial_attach_ok and if MPIR_debug_gate is defined the tool might set it anyway even if the MPI implementation does not require it.
>
>    
>> Shouldn't setting the MPIR_debug_gate variable" is
>> NOOP under this condition?
>>      
> Yes.
>
>    
>> If so, how could the tool acquire the
>> processes using this technique?
>>      
> I'm not sure I understand the question. Do you mean, "how could the tool *synchronize* startup with the processes?" The answer is that the tool wouldn't need to do anything special to synchronize startup because the MPI implementation is handling it. The tool might *think* it needs to by setting MPIR_debug_gate (if defined) but that would have no effect.
>
>    

Thank you for the clarification. I propose to change the wording to

       "The tool may choose to ignore the presence of the
MPIR_partial_attach_ok symbol and acquire all MPI rank processes. The
presence of this symbol does not prevent the tool from setting
the MPIR_debug_gate variable (if defined), which should have no effect."


>> - Section 7.1
>>        In this case, a potential race condition can occur as the tool
>> cannot precisely control when those daemons are launched especially
>> under a synchronization scheme like BlueGene's: The job may start
>> running before daemons are launched.
>>      
> No, that's not how it works. When the MPI job is launched under the control of the debugger on Blue Gene, the servers are launched and the MPI runtime system arranges for the processes to be created in a stopped state. So, the debugger has an indefinite amount of time to connect to the servers and issue attach requests. As best I can tell, the processes are just an empty execution contexts containing no threads. After the debugger issues the CIOD ATTACH request, the process and initial thread is fully created, and it stops with a single-step event. When the debugger see the event, it knows that the attach is complete.
>
>    
>> To minimize this kind of problem,
>> can we be explicit about the requirement as to when the daemons should
>> be launched in relation to when the job is loaded and run?
>>      
> Sure, I like explicit. What would you like it to say?
>    

How about, adding  #5

5. For launching, the tool holds the starter process at the 
MPIR_DEBUG_SPAWNED event until the tool daemons are launched and 
communications are established so that the MPI rank processes are not 
released.


>    
>> On BlueGene,
>> if the job is loaded and run before daemons are launched onto the I/O
>> nodes and issue (pre)ATTACH commands to the compute node processes, the
>> daemons won't be able to acquire the processes before the processes run
>> away from their initial instructions.
>>      
> Like I said above, there is no danger of the processes running away on Blue Gene. In fact on Blue Gene after launching an MPI job under the debugger, only 1 user-mode instruction at "_start" is executed, and the initial thread's PC is at "_start+4".
>
>    
>> - Section 9.2
>>        I like that this document encourages the MPI implementation to
>> "share the host and executable name character strings across
>> multiple process descriptor entries whenever possible."
>>      
> Yes, we saw a real-life situation where an MPI implementation didn't share the strings and that hurt scalability a lot because the debugger was spending a lot of time reading redundant strings through the tracing (ptrace()) interface which is expensive. If the strings are shared, the debugger data cache avoids the tracing interface.
>
>    
>> The section
>> states that this would improve tools' performance at scale.
>>      
> In the particular case we were looking at, it improved TotalView's performance at scale because the time to fetch the MPIR proctable contents was dramatically improved.
>
>    
>> But our
>> experience has been that it would improve RM's scalability as RMs
>> themselves often consist of distributed components and the bigger
>> strings means larger communication overheads within RM's
>> inter-communication. And the impact gets magnified at extreme scale. So,
>> can you consider saying to the effect that "not sharing these character
>> strings have been a major scalability problem within the RMs themselves
>> as well."
>>      
> I see your point, but I'm not sure that we're talking about a problem that is relevant to the MPIR interface. In your case, the RM has to make sure that it sends MPIR the information back to the MPI starter in an efficient manner, possibly by using reduction trees. Once the MPI starter has the information in whatever form it chooses, it has to populate the MPIR procdesc table such that the tool can read the table and the strings pointed to by the table entries through the tracing interface as efficiently as possible.
>
> I think the RM's problem of gathering the MPIR information is one level removed from the MPIR interface, so I'm not sure that your statement belongs. But, I don't feel strongly about it one way or another, so whatever the group decides is fine with me.
>
>    

Actually, the RM scalability problem I was referring to has nothing to 
do with proctable information gathering (e.g., via a reduction network). 
It was due to poor string handling within back-end and front-end 
mpiruns: "strcat"ing lots of individual hostname and exectuable_name 
strings was a performance killer. Essentially this is the "producer" 
part of the the real-world performance problem of TotalView you are 
describing.

Perhaps, changing the last sentence of Section 9.2 to

"Sharing the strings enhances the scalability of the starter process and 
the tool by allowing them to avoid generating and reading redundant 
character strings."


>> - Section 9.13
>>        In addition to the requirement that I mentioned in relation to
>> Section 6.2 above,  can you be more explicit about "the tool need only
>> release the starter process to release the whole MPI job, which can
>> therefore be run without requiring the tool to acquire all of the MPI
>> processes included in the MPI job." Perhaps, the tool need only to
>> release the starter process from MPIR_Breakpoint?
>>      
> Good point. It should say that the tool is releasing the MPI starter process from the MPIR_DEBUG_SPAWNED event.
>
> Cheers, John D.
>
>
>    
>> Best,
>> Dong
>>
>> On 6/14/2010 9:27 AM, Martin Schulz wrote:
>>      
>>> Hi all,
>>>
>>> Attached is the latest and updated version of the MPIR document, which John
>>> DelSignore put together. The intent is still to publish this through the MPI forum
>>> as an official document. The details for this are still tbd. and Jeff will lead a
>>> discussion on this topic during the forum this week.
>>>
>>> We don't have a tools WG meeting scheduled for meeting, but if you have
>>> any comments or feedback (on the document or how we should publish it),
>>> please post it to the list. If necessary or useful, we can also dedicate one
>>> of the upcoming tools TelCons for this.
>>>
>>> Thanks!
>>>
>>> Martin
>>>
>>> PS: Feel free to distribute the document further, in particular to tool and
>>> MPI developers.
>>>
>>>
>>>
>>>
>>>        
>> _______________________________________________
>> Mpi3-tools mailing list
>> Mpi3-tools at lists.mpi-forum.org
>> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
>>
>>      
> _______________________________________________
> Mpi3-tools mailing list
> Mpi3-tools at lists.mpi-forum.org
> http://*lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
>
>    




More information about the mpiwg-tools mailing list