[Mpi3-tools] MPI State Profiling Proposal
Martin Schulz
schulzm at llnl.gov
Mon Dec 15 03:11:36 CST 2008
Hi Marty, all,
I added this to the wiki - it is now at:
https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MPI3Tools/state
Martin
At 02:16 PM 12/12/2008, you wrote:
>I'm attaching a draft of the MPI State Profiling description. Can someone
>please add it in the right place to the website in time for the
>meeting Monday?
>
> Thanks,
> Marty
>
>
>
>
>MPI State Profiling
>
>
>
>
>
>Marty Itzkowitz
>
>
>
>
>
>Sun Microsystems
>
>
>
>
>
>Introduction
>
>
>
>Most current profiling for MPI applications is based on tracing API
>calls into the MPI runtime library with, for example, VampirTrace.
>The data collected shows API calls and the messages that are sent
>and received. In general, all trace data from all ranks is needed in
>order to be able to match sends and receives, and that causes high
>data volume and other scalability issues.
>
>We are here proposing an additional means of profiling MPI
>applications based on statistical sampling of the MPI runtime state.
>It has improved scalability properties relative to MPI API tracing,
>and supports selectivity.
>
>The proposal described here has been implemented in Sun's
>ClusterTools 8.1 product, based on Open MPI. Sun will be happy to
>contribute the code back to the Open MPI tree.
>
>
>
>MPI State Profiling
>
>
>
>The basic idea is to have the MPI runtime maintain a per-thread
>state variable, and provide an asynchronous-signal-safe API for a
>thread to read its own state. In the prototype implementation, there
>are three states:
>Not in MPI
>MPI-Work (In the MPI runtime, and working)
>MPI-Stall (In the MPI runtime, but stalled for some reason)
>The interface, as prototyped, only has one state for MPI-Stall, no
>matter why the runtime is stalled. It would be a simple extension to
>define multiple MPI-Stall states, one for each reason it is stalled.
>
>MPI-Stall Time is accumulated when the thread is stalled within MPI,
>whether the thread is busy-waiting (when the process is also
>accumulating User CPU Time), as well as when the process is
>sleep-waiting (when the process is not accumulating User CPU Time).
>In other words, MPI-Stall time is a statistical measure of when the
>process is blocked inside the MPI library.
>
>Note that MPI-Stall time is not a measure of time spent in the
>MPI_Wait call. Some of the time in that call is MPI-Work time, and
>some of the time spent in other calls may be MPI-Stall time.
>
>MPI-Work means that the runtime is doing something other than
>waiting. It is not necessarily directly-useful work from the point
>of view of the application -- it may be what the user would think of
>as overhead.
>
>A profiler can then read the state with each profile tick, and
>accumulate a statistical measure of time spent in MPI-Work and time
>spent in MPI-Stall. If callstacks are recorded at the same time as
>the state is captured, that MPI state can be attributed to the
>frames in the callstack, allowing the user to see which MPI calls
>are spending the most time stalled, and how the user got to that call.
>
>
>
>Runtime Implementation
>
>
>
>Implementation requires three things: the maintenance of the state,
>an API to read it, and a rendezvous mechanism for the profiler to
>find the API.
>
>Maintenance of the state is straightforward. All states start out in
>the Not-in-MPI state. Whenever an API call is entered, the state for
>the calling thread is changed to MPI-Work; whenever an API call
>returns, the state is changed back to Not-in-MPI. During the
>processing, whenever the OMPI progress engine determines that no
>progress is being made on any outstanding requests, the state is
>changed to MPI-Stall; Whenever the progress engine detects progress
>being made, the state is changed back to MPI-Work. The work to wait
>transition points are very implementation-specific, and must be done
>carefully to avoid high overhead.
>
>The API is quite simple. There is only one function:
>OMPI_state_t (*mpi_collector_api)(void*)
>
>The return value is an enum representing the current state for the
>calling thread. That function must be asynchronous-signal-safe, so
>that it can be called from a signal handler processing a profile tick.
>
>Rendezvous is also simple. The profile code asks the runtime linker
>if the API described above is present in the process address space.
>If so, the functionality is available. If not, the functionality is
>not available.
>
>
>
>Using the API
>
>
>
>In the prototype, the Sun Studio Profiler invokes the API if it is
>available to ask for the data on every clock-profiling tick.
>MPI-Work time is recorded as such, unless the thread is not on CPU,
>in which case it is converted to MPI-Stall time. On Linux, ticks do
>not occur when the thread is not on CPU, so the technique is valid
>only for busy-waits.
>
>Overhead in maintaining state is a important consideration. There is
>relatively little overhead in Sun's implementation for most MPI
>operations, but there is some performance cost. The worst case we've
>seen is in the latency for short messages, where the instrumented
>runtime shows about ~100-200 ns. increase, or about 10%. For that
>reason, Sun chose to ship both instrumented and uninstrumented
>libraries. We also provide a flag to mpirun that automatically
>switches to the instrumented libraries.
>
>
>
>Scalability Differences between API Tracing and Clock-Profiling
>
>
>
>MPI can be used for very large scale jobs, using hundreds, if not
>thousands, of processors and MPI processes. In such cases,
>scalability of the performance measurements becomes significant.
>
>MPI API tracing records data proportional to the number of calls and
>messages, collected from all ranks, and suffers dilation in the same
>proportion. If the data volume becomes too great, the time to match
>the sends and the receives becomes unacceptable. Rank-selective
>profiling of MPI API tracing data cannot be done without giving up
>on matching sends and receives: if some data from some ranks are
>missing, that matching cannot be done. Likewise, time-selective API
>tracing presents problems at the boundaries where part of a
>transaction may be present, and part absent.
>
>On the other hand, the data volume and overhead from clock-profiling
>depends on the total time the processes run and the profiling
>frequency. Since the data is a statistical sample, rank-selective
>data collection is reasonable as long as the selected ranks are
>representative. Likewise time-selective data collection for any
>representative interval is reasonable. Furthermore, data volume and
>distortion may also be managed by changing the profiling frequency.
>
>_______________________________________________
>Mpi3-tools mailing list
>Mpi3-tools at lists.mpi-forum.org
>http:// lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools
_______________________________________________________________________
Martin Schulz, schulzm at llnl.gov, http://people.llnl.gov/schulz6
CASC @ Lawrence Livermore National Laboratory, Livermore, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mpi-forum.org/pipermail/mpiwg-tools/attachments/20081215/01ab69b8/attachment-0001.html>
More information about the mpiwg-tools
mailing list