<HTML>

<HEAD>

<TITLE>MPI State Profiling</TITLE>

</HEAD>

<BODY BGCOLOR="lightgoldenrodyellow">

<center>

<H1>

<p>

MPI State Profiling

</H1>

<h2><i>Marty Itzkowitz</i></h2>

<h3><i>Sun Microsystems</i></h3>

</center>

<h3>Introduction</h3>

<blockquote>

Most current profiling for MPI applications is based on tracing

API calls into the MPI runtime library with, for example, VampirTrace.

The data collected shows API calls and the messages that are

sent and received.  In general, all trace data from all ranks

is needed in order to be able to match sends and receives,

and that causes high data volume and other scalability issues.

<p>

We are here proposing an additional means of profiling MPI

applications based on statistical sampling of the MPI runtime

state.  It has improved scalability properties relative

to MPI API tracing, and supports selectivity.

<p>

The proposal described here has been implemented in Sun's ClusterTools 8.1

product, based on Open MPI.  Sun will be happy to contribute the code back

to the Open MPI tree.

</blockquote>

<h3>MPI State Profiling</h3>

<blockquote>

<p>

The basic idea is to have the MPI runtime maintain a per-thread

state variable, and provide an asynchronous-signal-safe API

for a thread to read its own state.  In the prototype

implementation, there are three states:

<ul>

<li>Not in MPI

<li>MPI-Work (In the MPI runtime, and working)

<li>MPI-Stall (In the MPI runtime, but stalled for some reason)

</ul>

<p>

The interface, as prototyped, only has one state for MPI-Stall, no

matter why the runtime is stalled.  It would be a simple extension to define

multiple MPI-Stall states, one for each reason it is stalled.

<p>

MPI-Stall Time is accumulated when the thread is stalled within

MPI, whether the thread is busy-waiting (when the process

is also accumulating User CPU Time),

as well as when the process is sleep-waiting (when the process is

<u>not</u> accumulating User CPU Time).  In other words,

MPI-Stall time is a statistical measure of when the process is

blocked inside the MPI library.

<p>

Note that  MPI-Stall time is not a measure of time spent in the <tt>MPI_Wait</tt> call.

Some of the time in that call is MPI-Work time, and some of the time spent

in other calls may be MPI-Stall time.

<p>

MPI-Work means that the runtime is doing something other than waiting.

It is not necessarily directly-useful work from the point of

view of the application -- it may be what the user would think

of as overhead.

<p>

A profiler can then read the state with each profile tick, and accumulate

a statistical measure of time spent in MPI-Work and time spent in

MPI-Stall.  If callstacks are recorded at the same time as the state

is captured, that MPI state can be attributed to the frames

in the callstack, allowing the user to see which MPI calls are

spending the most time stalled, and how the user got to that call.

</blockquote>

<h3>Runtime Implementation</h3>

<blockquote>

Implementation requires three things: the maintenance of the state, an API

to read it, and a rendezvous mechanism for the profiler to find the API.

<p>

Maintenance of the state is straightforward.

All states start out in the Not-in-MPI state.

Whenever an API call is entered, the state for the calling thread

is changed to MPI-Work; whenever an API call returns, the state

is changed back to Not-in-MPI.  During the processing, whenever

the OMPI progress engine determines that no progress is being made

on any outstanding requests, the state is changed to MPI-Stall;

Whenever the progress engine detects progress being made, the state

is changed back to MPI-Work.  The work to wait transition points

are very implementation-specific, and must be done carefully

to avoid high overhead.

<p>

The API is quite simple.  There is only one function:

<blockquote>

OMPI_state_t (*mpi_collector_api)(void*)

</blockquote>

The return value is an enum representing the current state for the calling thread.

That function must be asynchronous-signal-safe, so that it can be called

from a signal handler processing a profile tick.

<p>

Rendezvous is also simple.  The profile code asks the runtime linker

if the API described above is present in the process address space.

If so, the functionality is available.  If not, the functionality

is not available.

</blockquote>

<h3>Using the API</h3>

<blockquote>

In the prototype, the Sun Studio Profiler invokes the API if it

is available to ask for the data on every clock-profiling tick.

MPI-Work time is recorded as such, unless the thread is not on CPU,

in which case it is converted to MPI-Stall time.  On Linux, ticks

do not occur when the thread is not on CPU, so the technique is

valid only for busy-waits.

<p>

Overhead in maintaining state is a important consideration.

There is relatively little overhead in Sun's implementation for most

MPI operations, but there is some performance cost.

The worst case we've seen is in the latency for short messages,

where the instrumented runtime shows about  ~100-200 ns. increase,

or about 10%.  For that reason, Sun chose to ship

both instrumented and uninstrumented libraries.  We also provide

a flag to mpirun that automatically switches to the instrumented

libraries.

</blockquote>

<h3>Scalability Differences between API Tracing and Clock-Profiling</h3>

<blockquote>

<p>

MPI can be used for very large scale jobs, using hundreds, if not thousands, of

processors and MPI processes.  In such cases, scalability of the performance

measurements becomes significant.

<p>

MPI API tracing records data proportional to the number of calls and messages,

collected from all ranks, and suffers dilation in the same proportion.

If the data volume becomes too great, the time to match the sends and

the receives becomes unacceptable.

Rank-selective profiling of MPI API tracing data cannot be done without giving

up on matching sends and receives: if some data from some ranks are

missing, that matching cannot be done.  Likewise, time-selective

API tracing presents problems at the boundaries where part of a

transaction may be present, and part absent.

<p>

On the other hand, the data volume and overhead from clock-profiling depends on the

total time the processes run and the profiling frequency.

Since the data is a statistical sample, rank-selective data collection

is reasonable as long as the selected ranks are representative.

Likewise time-selective data collection for any representative

interval is reasonable.  Furthermore, data volume and distortion

may also be managed by changing the profiling frequency.

</blockquote>

</BODY>