https://perf.wiki.kernel.org/api.php?action=feedcontributions&user=Willemb&feedformat=atomPerf Wiki - User contributions [en]2024-03-28T12:10:35ZUser contributionsMediaWiki 1.19.24https://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T19:39:01Z<p>Willemb: nitpicking, really</p>
<hr />
<div><big>'''Linux kernel profiling with <tt>perf</tt>'''</big><br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
== Other Resources ==<br />
<br />
=== Linux sourcecode ===<br />
The <tt>perf tools</tt> sourcecode lives in the Linux kernel tree under [http://lxr.linux.no/linux+v2.6.39/tools/perf/| <tt>/tools/perf</tt>]. You will find much more documentation in [http://lxr.linux.no/linux+v2.6.39/tools/perf/Documentation/ | <tt>/tools/perf/Documentation</tt>]. To build manpages, info pages and more, install these tools:<br />
<br />
* asciidoc<br />
* tetex-fonts<br />
* tetex-dvips<br />
* dialog<br />
* tetex<br />
* tetex-latex<br />
* xmltex<br />
* passivetex<br />
* w3m<br />
* xmlto<br />
<br />
and issue a <tt>make install-man</tt> from <tt>/tools/perf</tt>. This step is also required to <br />
be able to run <tt>perf help <command></tt>.<br />
<br />
----<br />
<br />
This guide is adapted from a tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/Main_PageMain Page2011-06-29T15:43:02Z<p>Willemb: </p>
<hr />
<div>== <tt>perf:</tt> Linux profiling with performance counters ==<br />
''...More than just counters...''<br />
<br />
=== Introduction ===<br />
<br />
This is the wiki page for the <tt>perf</tt> performance counters subsystem in Linux.<br />
Performance counters are CPU hardware registers that <br />
count hardware events such <br />
as instructions executed, cache-misses suffered, or branches mispredicted. They form<br />
a basis for profiling applications to trace dynamic control flow and identify hotspots. <br />
<br />
<tt>perf</tt> provides rich generalized abstractions over hardware specific<br />
capabilities. Among others, it provides per task, per CPU and per-workload counters,<br />
sampling on top of these and source code event annotation.<br />
<br />
The userspace <tt>perf tools</tt> present a simple to use interface with commands like<br />
<br />
* <tt>[[Tutorial#Counting_with_perf_stat| perf stat</tt>]]: obtain event counts<br />
* <tt>[[Tutorial#Sampling_with_perf_record | perf record</tt>]]: record events for later reporting<br />
* <tt>[[Tutorial#Sample_analysis_with_perf_report | perf report</tt>]]: break down events by process, function, etc.<br />
* <tt>[[Tutorial#Source_level_analysis_with_perf_annotate | perf annotate</tt>]]: annotate assembly or source code with event counts<br />
* <tt>[[Tutorial#Live_analysis_with_perf_top | perf top</tt>]]: see live event count <br />
<br />
To learn more, see the examples in the <tt>[[Tutorial]]</tt>.<br />
<br />
=== Wiki Contents ===<br />
<br />
* [[Tutorial]]<br />
* [[Todo]]<br />
* [[HardwareReference]]</div>Willembhttps://perf.wiki.kernel.org/index.php/Main_PageMain Page2011-06-29T15:37:25Z<p>Willemb: move sections into subpages (to match tutorial subpage)</p>
<hr />
<div><b><big><i><center>...More than just counters...</center></i></big></b><br />
<br />
<br />
== <tt>perf:</tt> Linux profiling with performance counters ==<br />
<br />
=== Introduction ===<br />
<br />
This is the wiki page for the <tt>perf</tt> performance counters subsystem in Linux.<br />
Performance counters are CPU hardware registers that <br />
count hardware events such <br />
as instructions executed, cache-misses suffered, or branches mispredicted. They form<br />
a basis for profiling applications to trace dynamic control flow and identify hotspots. <br />
<br />
<tt>perf</tt> provides rich generalized abstractions over hardware specific<br />
capabilities. Among others, it provides per task, per CPU and per-workload counters,<br />
counter groups, and sampling capabilities on top of those.<br />
The userspace <tt>perf tools</tt> present a simple to use interface with commands like<br />
<br />
* <tt>[[Tutorial#Counting_with_perf_stat| perf stat</tt>]]: obtain event counts<br />
* <tt>[[Tutorial#Sampling_with_perf_record | perf record</tt>]]: record events for later reporting<br />
* <tt>[[Tutorial#Sample_analysis_with_perf_report | perf report</tt>]]: break down events by process, function, etc.<br />
* <tt>[[Tutorial#Source_level_analysis_with_perf_annotate | perf annotate</tt>]]: annotate assembly or source code with event counts<br />
* <tt>[[Tutorial#Live_analysis_with_perf_top | perf top</tt>]]: see live event count <br />
<br />
To learn more, see the examples in the <tt>[[Tutorial]]</tt>.<br />
<br />
== Wiki Contents ==<br />
<br />
* [[Tutorial]]<br />
* [[Todo]]<br />
* [[HardwareReference]]</div>Willembhttps://perf.wiki.kernel.org/index.php/HardwareReferenceHardwareReference2011-06-29T15:22:25Z<p>Willemb: move hardware specific information from homepage</p>
<hr />
<div>The capabilities of perf depends on the features of the CPU. The following list the features of common modern architectures.<br />
<br />
* Performance Monitoring Units (PMUs)<br />
** [[Nehalem | Intel(TM) x86 Nehalem PMU]]<br />
** [[Montecito | Intel(TM) Itanium(TM) 2 PMU]]<br />
* Performance Counters for Linux<br />
** [[PCLstruct| PCL core kernel data structures]]<br />
** [[PCL internals | PCL core kernel internals]]<br />
** [[perf internals | perf tool internals]]</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T15:21:09Z<p>Willemb: integrate "notes" from main page into tutorial</p>
<hr />
<div><big>'''Linux kernel profiling with <tt>perf</tt>'''</big><br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
== Other Resources ==<br />
<br />
=== Linux sourcecode ===<br />
The <tt>perf tools</tt> sourcecode lives in the Linux kernel tree under [http://lxr.linux.no/linux+v2.6.39/tools/perf/| <tt>/tools/perf</tt>]. You will find much more documentation in [http://lxr.linux.no/linux+v2.6.39/tools/perf/Documentation/ | <tt>/tools/perf/Documentation</tt>]. To build manpages, info pages and more, install these tools:<br />
<br />
* asciidoc<br />
* tetex-fonts<br />
* tetex-dvips<br />
* dialog<br />
* tetex<br />
* tetex-latex<br />
* xmltex<br />
* passivetex<br />
* w3m<br />
* xmlto<br />
<br />
and issue a <tt>make install-man</tt> from <tt>/tools/perf</tt>. This step is also required to <br />
be able to run <tt>perf help <command></tt>.<br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/TodoTodo2011-06-29T15:14:45Z<p>Willemb: moved TODO contents from main page to subpage</p>
<hr />
<div>=== Perf tools ===<br />
<br />
* Factorize the multidimensional sorting between perf report and annotate (will be used by perf trace)<br />
* Implement a perf cmp (profile comparison between two perf.data)<br />
* Implement a perf view (GUI)<br />
* Enhance perf trace:<br />
** Handle the cpu field<br />
** Handle the timestamp<br />
** Use the in-perf ip -> symbol resolving<br />
** Use the in-perf pid -> cmdline resolving<br />
** Implement multidimensional sorting by field name</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T15:12:20Z<p>Willemb: new title</p>
<hr />
<div><big>'''Linux kernel profiling with <tt>perf</tt>'''</big><br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T15:10:51Z<p>Willemb: remove top title == heading</p>
<hr />
<div>__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T15:10:04Z<p>Willemb: minor fix</p>
<hr />
<div>= Linux profiling with Perf =<br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T15:09:14Z<p>Willemb: fixes for twiki to mediawiki conversion</p>
<hr />
<div>= Linux profiling with Perf =<br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- <br />
|u || boolean || monitor at priv level 3, 2, 1 (user)|| event:u=1 or event:u<br />
|- <br />
|k || boolean || monitor at priv level 0 (kernel) || event:k=1 or event:k<br />
|- <br />
|c || integer || threshold monitoring: number of cycles when n or more occurrences of event occur || event:c=2<br />
|- <br />
|i || boolean || invert the test of threshold: number of cycles in which less than n occurrences of the event occur|| event:c=2:i<br />
|- <br />
|e || boolean || edge detect, increment the counter only when the condition goes from false -> true || event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willembhttps://perf.wiki.kernel.org/index.php/TutorialTutorial2011-06-29T14:52:48Z<p>Willemb: Initial import of complete perf tutorial</p>
<hr />
<div>= Linux profiling with Perf =<br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-fault.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware<br />
event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Modifiers have a type indicated in parenthesis.<br />
The type determines the valid values. The value is passed after the equal sign (no space).<br />
Booleans accept <tt>0, 1, y, n</tt>. To set a boolean modifier to true, it is possible to use <tt>u=1</tt> or<br />
simply <tt>u</tt>. Integer may have range restrictions, see <tt>c</tt> modifier in the example above.<br />
Note: When using '''hardware''' events, e.g., <tt>cycles</tt>, only the <tt>u</tt> and <tt>k</tt> modifiers<br />
are accepted. To measure at both user and kernel level use <tt>cycles:uk</tt>. In other words, there<br />
is no colon separator between the modifiers.<br />
<br />
To measure a PMU event and pass unit masks and modifiers:<br />
%RED%<br />
<pre><br />
perf stat -e inst_retired:any_p:u:c=1:i dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
%ENDCOLOR%<br />
In this example, we are measuring the number of cycles at the user level in which<br />
less (i) than 1 (c=1) instruction is retired per cycles. Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{|<br />
! Modifiers<br />
! Type<br />
! Description<br />
! Example<br />
|- |u | boolean | monitor at priv level 3, 2, 1 (user)| event:u=1 or event:u<br />
|- |k | boolean | monitor at priv level 0 (kernel) | event:k=1 or event:k<br />
|- |c | integer | threshold monitoring: number of cycles when n or more occurrences of event occur | event:c=2<br />
|- |i | boolean | invert the test of threshold: number of cycles in which less than n occurrences of the event occur| event:c=2:i<br />
|- |e | boolean | edge detect, increment the counter only when the condition goes from false -> true | event:e or event:e=1<br />
|}<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{|<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|+ |Intel Core | 2 | 3<br />
|+ |Intel Nehalem| 4 | 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,946,289 cycles (scaled from 74.98%)<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In *per-thread* mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The per-process mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In per-cpu mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
=perf stat= can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly an another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. *For this reason, care must be taken<br />
when interpreting profiles*.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
=perf annotate= can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the ==ulimit== shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
----<br />
<br />
This guide is adapted from an earlier tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Willemb