<html>

<body>

Hi Marty, all,<br><br>

I added this to the wiki - it is now at:<br><br>

<a href="https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MPI3Tools/state" eudora="autourl">

https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/MPI3Tools/state<br><br>

</a>Martin<br><br>

<br>

At 02:16 PM 12/12/2008, you wrote:<br>

<blockquote type=cite class=cite cite="">I'm attaching a draft of the MPI

State Profiling description.  Can someone<br>

please add it in the right place to the website in time for the meeting

Monday?<br><br>

   Thanks,<br>

      Marty<br><br>

<br>

<div align="center"><br>

<h1><b>MPI State Profiling </b></h1><br><br>

</div>

<br>

<h2><b>Marty Itzkowitz</i></b></h2><br><br>

<br>

<h3><b>Sun Microsystems</i></b></h3><br><br>

<br>

<h3><b>Introduction</b></h3><br><br>

<dl>

<dd>Most current profiling for MPI applications is based on tracing API

calls into the MPI runtime library with, for example, VampirTrace. The

data collected shows API calls and the messages that are sent and

received. In general, all trace data from all ranks is needed in order to

be able to match sends and receives, and that causes high data volume and

other scalability issues. <br><br>

<dd>We are here proposing an additional means of profiling MPI

applications based on statistical sampling of the MPI runtime state. It

has improved scalability properties relative to MPI API tracing, and

supports selectivity. <br><br>

<dd>The proposal described here has been implemented in Sun's

ClusterTools 8.1 product, based on Open MPI. Sun will be happy to

contribute the code back to the Open MPI tree. <br><br>

</dl><br>

<h3><b>MPI State Profiling</b></h3><br><br>

<dl>

<dd>The basic idea is to have the MPI runtime maintain a per-thread state

variable, and provide an asynchronous-signal-safe API for a thread to

read its own state. In the prototype implementation, there are three

states: 

<dd>Not in MPI 

<dd>MPI-Work (In the MPI runtime, and working) 

<dd>MPI-Stall (In the MPI runtime, but stalled for some reason) <br>

<dd>The interface, as prototyped, only has one state for MPI-Stall, no

matter why the runtime is stalled. It would be a simple extension to

define multiple MPI-Stall states, one for each reason it is stalled.

<br><br>

<dd>MPI-Stall Time is accumulated when the thread is stalled within MPI,

whether the thread is busy-waiting (when the process is also accumulating

User CPU Time), as well as when the process is sleep-waiting (when the

process is not</u> accumulating User CPU Time). In other words, MPI-Stall

time is a statistical measure of when the process is blocked inside the

MPI library. <br><br>

<dd>Note that MPI-Stall time is not a measure of time spent in the

<tt>MPI_Wait</tt> call. Some of the time in that call is MPI-Work time,

and some of the time spent in other calls may be MPI-Stall time.

<br><br>

<dd>MPI-Work means that the runtime is doing something other than

waiting. It is not necessarily directly-useful work from the point of

view of the application -- it may be what the user would think of as

overhead. <br><br>

<dd>A profiler can then read the state with each profile tick, and

accumulate a statistical measure of time spent in MPI-Work and time spent

in MPI-Stall. If callstacks are recorded at the same time as the state is

captured, that MPI state can be attributed to the frames in the

callstack, allowing the user to see which MPI calls are spending the most

time stalled, and how the user got to that call. <br><br>

</dl><br>

<h3><b>Runtime Implementation</b></h3><br><br>

<dl>

<dd>Implementation requires three things: the maintenance of the state,

an API to read it, and a rendezvous mechanism for the profiler to find

the API. <br><br>

<dd>Maintenance of the state is straightforward. All states start out in

the Not-in-MPI state. Whenever an API call is entered, the state for the

calling thread is changed to MPI-Work; whenever an API call returns, the

state is changed back to Not-in-MPI. During the processing, whenever the

OMPI progress engine determines that no progress is being made on any

outstanding requests, the state is changed to MPI-Stall; Whenever the

progress engine detects progress being made, the state is changed back to

MPI-Work. The work to wait transition points are very

implementation-specific, and must be done carefully to avoid high

overhead. <br><br>

<dd>The API is quite simple. There is only one function: <br>

<dl>

<dd>OMPI_state_t (*mpi_collector_api)(void*) <br>

<br>

</dl>

<dd>The return value is an enum representing the current state for the

calling thread. That function must be asynchronous-signal-safe, so that

it can be called from a signal handler processing a profile tick.

<br><br>

<dd>Rendezvous is also simple. The profile code asks the runtime linker

if the API described above is present in the process address space. If

so, the functionality is available. If not, the functionality is not

available. <br><br>

</dl><br>

<h3><b>Using the API</b></h3><br><br>

<dl>

<dd>In the prototype, the Sun Studio Profiler invokes the API if it is

available to ask for the data on every clock-profiling tick. MPI-Work

time is recorded as such, unless the thread is not on CPU, in which case

it is converted to MPI-Stall time. On Linux, ticks do not occur when the

thread is not on CPU, so the technique is valid only for busy-waits.

<br><br>

<dd>Overhead in maintaining state is a important consideration. There is

relatively little overhead in Sun's implementation for most MPI

operations, but there is some performance cost. The worst case we've seen

is in the latency for short messages, where the instrumented runtime

shows about ~100-200 ns. increase, or about 10%. For that reason, Sun

chose to ship both instrumented and uninstrumented libraries. We also

provide a flag to mpirun that automatically switches to the instrumented

libraries. <br><br>

</dl><br>

<h3><b>Scalability Differences between API Tracing and

Clock-Profiling</b></h3><br><br>

<dl>

<dd>MPI can be used for very large scale jobs, using hundreds, if not

thousands, of processors and MPI processes. In such cases, scalability of

the performance measurements becomes significant. <br><br>

<dd>MPI API tracing records data proportional to the number of calls and

messages, collected from all ranks, and suffers dilation in the same

proportion. If the data volume becomes too great, the time to match the

sends and the receives becomes unacceptable. Rank-selective profiling of

MPI API tracing data cannot be done without giving up on matching sends

and receives: if some data from some ranks are missing, that matching

cannot be done. Likewise, time-selective API tracing presents problems at

the boundaries where part of a transaction may be present, and part

absent. <br><br>

<dd>On the other hand, the data volume and overhead from clock-profiling

depends on the total time the processes run and the profiling frequency.

Since the data is a statistical sample, rank-selective data collection is

reasonable as long as the selected ranks are representative. Likewise

time-selective data collection for any representative interval is

reasonable. Furthermore, data volume and distortion may also be managed

by changing the profiling frequency. <br><br>

</dl>_______________________________________________<br>

Mpi3-tools mailing list<br>

Mpi3-tools@lists.mpi-forum.org<br>

<a href="http:///" eudora="autourl">http://</a>

lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-tools</blockquote>

<x-sigsep><p></x-sigsep>

_______________________________________________________________________<br>

Martin Schulz, schulzm@llnl.gov,

<a href="http://people.llnl.gov/schulz6" eudora="autourl">

http://people.llnl.gov/schulz6<br>

</a>CASC @ Lawrence Livermore National Laboratory, Livermore, USA

</body>

</html>