[Mpi3-tools] MPI State Profiling Proposal

Mon Dec 15 03:11:36 CST 2008

Hi Marty, all,

I added this to the wiki - it is now at:

https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MPI3Tools/state

Martin

At 02:16 PM 12/12/2008, you wrote:
>I'm attaching a draft of the MPI State Profiling description.  Can someone
>please add it in the right place to the website in time for the 
>meeting Monday?
>
>    Thanks,
>       Marty
>
>
>
>
>MPI State Profiling
>
>
>
>
>
>Marty Itzkowitz
>
>
>
>
>
>Sun Microsystems
>
>
>
>
>
>Introduction
>
>
>
>Most current profiling for MPI applications is based on tracing API 
>calls into the MPI runtime library with, for example, VampirTrace. 
>The data collected shows API calls and the messages that are sent 
>and received. In general, all trace data from all ranks is needed in 
>order to be able to match sends and receives, and that causes high 
>data volume and other scalability issues.
>
>We are here proposing an additional means of profiling MPI 
>applications based on statistical sampling of the MPI runtime state. 
>It has improved scalability properties relative to MPI API tracing, 
>and supports selectivity.
>
>The proposal described here has been implemented in Sun's 
>ClusterTools 8.1 product, based on Open MPI. Sun will be happy to 
>contribute the code back to the Open MPI tree.
>
>
>
>MPI State Profiling
>
>
>
>The basic idea is to have the MPI runtime maintain a per-thread 
>state variable, and provide an asynchronous-signal-safe API for a 
>thread to read its own state. In the prototype implementation, there 
>are three states:
>Not in MPI
>MPI-Work (In the MPI runtime, and working)
>MPI-Stall (In the MPI runtime, but stalled for some reason)
>The interface, as prototyped, only has one state for MPI-Stall, no 
>matter why the runtime is stalled. It would be a simple extension to 
>define multiple MPI-Stall states, one for each reason it is stalled.
>
>MPI-Stall Time is accumulated when the thread is stalled within MPI, 
>whether the thread is busy-waiting (when the process is also 
>accumulating User CPU Time), as well as when the process is 
>sleep-waiting (when the process is not accumulating User CPU Time). 
>In other words, MPI-Stall time is a statistical measure of when the 
>process is blocked inside the MPI library.
>
>Note that MPI-Stall time is not a measure of time spent in the 
>MPI_Wait call. Some of the time in that call is MPI-Work time, and 
>some of the time spent in other calls may be MPI-Stall time.
>
>MPI-Work means that the runtime is doing something other than 
>waiting. It is not necessarily directly-useful work from the point 
>of view of the application -- it may be what the user would think of 
>as overhead.
>
>A profiler can then read the state with each profile tick, and 
>accumulate a statistical measure of time spent in MPI-Work and time 
>spent in MPI-Stall. If callstacks are recorded at the same time as 
>the state is captured, that MPI state can be attributed to the 
>frames in the callstack, allowing the user to see which MPI calls 
>are spending the most time stalled, and how the user got to that call.
>
>
>
>Runtime Implementation
>
>
>
>Implementation requires three things: the maintenance of the state, 
>an API to read it, and a rendezvous mechanism for the profiler to 
>find the API.
>
>Maintenance of the state is straightforward. All states start out in 
>the Not-in-MPI state. Whenever an API call is entered, the state for 
>the calling thread is changed to MPI-Work; whenever an API call 
>returns, the state is changed back to Not-in-MPI. During the 
>processing, whenever the OMPI progress engine determines that no 
>progress is being made on any outstanding requests, the state is 
>changed to MPI-Stall; Whenever the progress engine detects progress 
>being made, the state is changed back to MPI-Work. The work to wait 
>transition points are very implementation-specific, and must be done 
>carefully to avoid high overhead.
>
>The API is quite simple. There is only one function:
>OMPI_state_t (*mpi_collector_api)(void*)
>
>The return value is an enum representing the current state for the 
>calling thread. That function must be asynchronous-signal-safe, so 
>that it can be called from a signal handler processing a profile tick.
>
>Rendezvous is also simple. The profile code asks the runtime linker 
>if the API described above is present in the process address space. 
>If so, the functionality is available. If not, the functionality is 
>not available.
>
>
>
>Using the API
>
>
>
>In the prototype, the Sun Studio Profiler invokes the API if it is 
>available to ask for the data on every clock-profiling tick. 
>MPI-Work time is recorded as such, unless the thread is not on CPU, 
>in which case it is converted to MPI-Stall time. On Linux, ticks do 
>not occur when the thread is not on CPU, so the technique is valid 
>only for busy-waits.
>
>Overhead in maintaining state is a important consideration. There is 
>relatively little overhead in Sun's implementation for most MPI 
>operations, but there is some performance cost. The worst case we've 
>seen is in the latency for short messages, where the instrumented 
>runtime shows about ~100-200 ns. increase, or about 10%. For that 
>reason, Sun chose to ship both instrumented and uninstrumented 
>libraries. We also provide a flag to mpirun that automatically 
>switches to the instrumented libraries.
>
>
>
>Scalability Differences between API Tracing and Clock-Profiling
>
>
>
>MPI can be used for very large scale jobs, using hundreds, if not 
>thousands, of processors and MPI processes. In such cases, 
>scalability of the performance measurements becomes significant.
>
>MPI API tracing records data proportional to the number of calls and 
>messages, collected from all ranks, and suffers dilation in the same 
>proportion. If the data volume becomes too great, the time to match 
>the sends and the receives becomes unacceptable. Rank-selective 
>profiling of MPI API tracing data cannot be done without giving up 
>on matching sends and receives: if some data from some ranks are 
>missing, that matching cannot be done. Likewise, time-selective API 
>tracing presents problems at the boundaries where part of a 
>transaction may be present, and part absent.
>
>On the other hand, the data volume and overhead from clock-profiling 
>depends on the total time the processes run and the profiling 
>frequency. Since the data is a statistical sample, rank-selective 
>data collection is reasonable as long as the selected ranks are 
>representative. Likewise time-selective data collection for any 
>representative interval is reasonable. Furthermore, data volume and 
>distortion may also be managed by changing the profiling frequency.
>
>_______________________________________________
>Mpi3-tools mailing list
>Mpi3-tools at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools

_______________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulz6
CASC @ Lawrence Livermore National Laboratory, Livermore, USA  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20081215/01ab69b8/attachment-0001.html>