https://perf.wiki.kernel.org/api.php?action=feedcontributions&user=Davidlohr+Bueso&feedformat=atomPerf Wiki - User contributions [en]2024-03-28T09:06:29ZUser contributionsMediaWiki 1.19.24https://perf.wiki.kernel.org/index.php/TutorialTutorial2014-05-31T00:14:07Z<p>Davidlohr Bueso: /* Benchmarking with perf bench */</p>
<hr />
<div><big>'''Linux kernel profiling with <tt>perf</tt>'''</big><br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-faults.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Events can optionally have a modifier by appending a colon and one or more modifiers.<br />
Modifiers allow the user to restrict when events are counted.<br />
<br />
To measure a PMU event and pass modifiers:<br />
<pre><br />
perf stat -e instructions:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of instructions at the user level.<br />
Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Description<br />
! Example<br />
|- <br />
|u || monitor at priv level 3, 2, 1 (user)|| event:u<br />
|- <br />
|k || monitor at priv level 0 (kernel) || event:k<br />
|- <br />
|h || monitor hypervisor events on a virtualization environment || event:h<br />
|-<br />
|H || monitor host machine on a virtualization environment || event:H<br />
|- <br />
|G || monitor guest machine on a virtualization environment || event:G<br />
|}<br />
<br />
All modifiers above are considered as a boolean (flag).<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly on another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Benchmarking with <tt>perf bench</tt> ==<br />
<br />
The <tt>perf bench</tt> command includes a number of multi-threaded microbenchmarks<br />
to exercise different subsystems in the Linux kernel and system calls. This allows<br />
hackers to easily stress and measure the impact of changes, and therefore help mitigate<br />
performance regressions.<br />
<br />
It also serves as a general benchmark framework, enabling developers to easily create<br />
test cases and transparently integrate and make use of the rich perf tool subsystem.<br />
<br />
=== sched: Scheduler benchmarks ===<br />
<br />
Measures <tt>pipe(2)</tt> and <tt>socketpair(2)</tt> operations between multiple tasks.<br />
Allows the measurement of thread versus process context switch performance.<br />
<br />
<pre><br />
$perf bench sched messaging -g 64<br />
# Running 'sched/messaging' benchmark:<br />
# 20 sender and receiver processes per group<br />
# 64 groups == 2560 processes run<br />
<br />
Total time: 1.549 [sec]<br />
</pre><br />
<br />
=== mem: Memory access benchmarks ===<br />
<br />
=== numa: NUMA scheduling and MM benchmarks ===<br />
<br />
=== futex: Futex stressing benchmarks ===<br />
<br />
Deals with finer grained aspects of the kernel's implementation of futexes. It is mostly <br />
useful for kernel hacking. It currently supports wakeup and requeue/wait operations, as <br />
well as stressing the hashing scheme for both private and shared futexes. An example run<br />
for nCPU threads, each handling 1024 futexes measuring the hashing logic:<br />
<br />
<pre><br />
$ perf bench futex hash<br />
# Running 'futex/hash' benchmark:<br />
Run summary [PID 17428]: 4 threads, each operating on 1024 [private] futexes for 10 secs.<br />
<br />
[thread 0] futexes: 0x2775700 ... 0x27766fc [ 3343462 ops/sec ]<br />
[thread 1] futexes: 0x2776920 ... 0x277791c [ 3679539 ops/sec ]<br />
[thread 2] futexes: 0x2777ab0 ... 0x2778aac [ 3589836 ops/sec ]<br />
[thread 3] futexes: 0x2778c40 ... 0x2779c3c [ 3563827 ops/sec ]<br />
<br />
Averaged 3544166 operations/sec (+- 2.01%), total secs = 10<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the <tt>ulimit</tt> shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
== Other Scenarios ==<br />
<br />
===Profiling sleep times===<br />
<br />
This feature shows where and how long a program is sleeping or waiting something.<br />
<br />
The first step is collecting data. We need to collect sched_stat and sched_switch events. Sched_stat events are not enough, because they are generated in the context of a task, which wakes up a target task (e.g. releases a lock). We need the same event but with a call-chain of the target task. This call-chain can be extracted from a previous sched_switch event.<br />
<br />
The second step is merging sched_start and sched_switch events. It can be done with help of "perf inject -s".<br />
<br />
$ ./perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ~/foo<br />
$ ./perf inject -v -s -i ~/perf.data.raw -o ~/perf.data<br />
$ ./perf report --stdio --show-total-period -i ~/perf.data<br />
# Overhead Period Command Shared Object Symbol<br />
# ........ ............ ....... ................. ..............<br />
#<br />
100.00% 502408738 foo [kernel.kallsyms] [k] __schedule<br />
|<br />
--- __schedule<br />
schedule<br />
| <br />
|--79.85%-- schedule_hrtimeout_range_clock<br />
| schedule_hrtimeout_range<br />
| poll_schedule_timeout<br />
| do_select<br />
| core_sys_select<br />
| sys_select<br />
| system_call_fastpath<br />
| __select<br />
| __libc_start_main<br />
| <br />
--20.15%-- do_nanosleep<br />
hrtimer_nanosleep<br />
sys_nanosleep<br />
system_call_fastpath<br />
__GI___libc_nanosleep<br />
__libc_start_main<br />
<br />
$cat foo.c<br />
...<br />
for (i = 0; i < 10; i++) {<br />
ts1.tv_sec = 0;<br />
ts1.tv_nsec = 10000000;<br />
nanosleep(&ts1, NULL);<br />
<br />
tv1.tv_sec = 0;<br />
tv1.tv_usec = 40000;<br />
select(0, NULL, NULL, NULL,&tv1);<br />
}<br />
...<br />
<br />
== Other Resources ==<br />
<br />
=== Linux sourcecode ===<br />
The <tt>perf tools</tt> sourcecode lives in the Linux kernel tree under [http://lxr.linux.no/linux+v2.6.39/tools/perf/| <tt>/tools/perf</tt>]. You will find much more documentation in [http://lxr.linux.no/linux+v2.6.39/tools/perf/Documentation/ | <tt>/tools/perf/Documentation</tt>]. To build manpages, info pages and more, install these tools:<br />
<br />
* asciidoc<br />
* tetex-fonts<br />
* tetex-dvips<br />
* dialog<br />
* tetex<br />
* tetex-latex<br />
* xmltex<br />
* passivetex<br />
* w3m<br />
* xmlto<br />
<br />
and issue a <tt>make install-man</tt> from <tt>/tools/perf</tt>. This step is also required to <br />
be able to run <tt>perf help <command></tt>.<br />
<br />
----<br />
<br />
This guide is adapted from a tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Davidlohr Buesohttps://perf.wiki.kernel.org/index.php/TutorialTutorial2014-05-30T20:51:39Z<p>Davidlohr Bueso: add initial perf-bench data</p>
<hr />
<div><big>'''Linux kernel profiling with <tt>perf</tt>'''</big><br />
<br />
__TOC__<br />
<br />
== Introduction ==<br />
<br />
Perf is a profiler tool for Linux 2.6+ based systems that abstracts away CPU hardware differences<br />
in Linux performance measurements and presents a simple commandline interface.<br />
Perf is based on the <tt>perf_events</tt> interface exported by recent versions of the Linux kernel. This article<br />
demonstrates the <tt>perf</tt> tool through example runs. Output was obtained on a Ubuntu 11.04<br />
system with<br />
kernel 2.6.38-8-generic results running on an HP 6710b with dual-core Intel Core2 T7100 CPU).<br />
For readability, some output is abbreviated using ellipsis (<tt>[...]</tt>).<br />
<br />
=== Commands ===<br />
<br />
The perf tool offers a rich set of commands to collect and analyze performance and trace data. The command line<br />
usage is reminiscent of <tt>git</tt> in that there is a generic tool, <tt>perf</tt>, which implements a set of commands:<br />
<tt>stat</tt>, <tt>record</tt>, <tt>report</tt>, [...]<br />
<br />
The list of supported commands:<br />
<pre><br />
perf<br />
<br />
usage: perf [--version] [--help] COMMAND [ARGS]<br />
<br />
The most commonly used perf commands are:<br />
annotate Read perf.data (created by perf record) and display annotated code<br />
archive Create archive with object files with build-ids found in perf.data file<br />
bench General framework for benchmark suites<br />
buildid-cache Manage <tt>build-id</tt> cache.<br />
buildid-list List the buildids in a perf.data file<br />
diff Read two perf.data files and display the differential profile<br />
inject Filter to augment the events stream with additional information<br />
kmem Tool to trace/measure kernel memory(slab) properties<br />
kvm Tool to trace/measure kvm guest os<br />
list List all symbolic event types<br />
lock Analyze lock events<br />
probe Define new dynamic tracepoints<br />
record Run a command and record its profile into perf.data<br />
report Read perf.data (created by perf record) and display the profile<br />
sched Tool to trace/measure scheduler properties (latencies)<br />
script Read perf.data (created by perf record) and display trace output<br />
stat Run a command and gather performance counter statistics<br />
test Runs sanity tests.<br />
timechart Tool to visualize total system behavior during a workload<br />
top System profiling tool.<br />
<br />
See 'perf help COMMAND' for more information on a specific command.<br />
</pre><br />
<br />
Certain commands require special support in the kernel and may not be<br />
available.<br />
To obtain the list of options for each command, simply type the command name followed by <tt>-h</tt>:<br />
<pre><br />
perf stat -h<br />
<br />
usage: perf stat [<options>] [<command>]<br />
<br />
-e, --event <event> event selector. use 'perf list' to list available events<br />
-i, --no-inherit child tasks do not inherit counters<br />
-p, --pid <n> stat events on existing process id<br />
-t, --tid <n> stat events on existing thread id<br />
-a, --all-cpus system-wide collection from all CPUs<br />
-c, --scale scale/normalize counters<br />
-v, --verbose be more verbose (show counter open errors, etc)<br />
-r, --repeat <n> repeat command and print average + stddev (max: 100)<br />
-n, --null null run - dont start any counters<br />
-B, --big-num print large numbers with thousands' separators<br />
</pre><br />
<br />
=== Events ===<br />
<br />
The <tt>perf</tt> tool supports a list of measurable events. The tool<br />
and underlying kernel interface can measure events coming from different<br />
sources. For instance, some event are pure kernel counters, in this case they are<br />
called '''software events'''. Examples include: context-switches, minor-faults.<br />
<br />
Another source of events is the processor itself and its Performance Monitoring<br />
Unit (PMU). It provides a list of events to measure micro-architectural events<br />
such as the number of cycles, instructions retired, L1 cache misses and so on.<br />
Those events are called '''PMU hardware events''' or '''hardware events''' for short.<br />
They vary with each processor type and model.<br />
<br />
The perf_events interface also provides a small set of common hardware<br />
events monikers. On each processor, those events get mapped<br />
onto an actual events provided by the CPU, if they exists, otherwise the event<br />
cannot be used. Somewhat confusingly, these are also called '''hardware events'''<br />
and '''hardware cache events'''.<br />
<br />
Finally, there are also '''tracepoint events''' which are implemented by the kernel <tt>ftrace</tt><br />
infrastructure. Those are '''only''' available with the 2.6.3x and newer kernels.<br />
<br />
To obtain a list of supported events:<br />
<pre><br />
perf list<br />
<br />
List of pre-defined events (to be used in -e):<br />
<br />
cpu-cycles OR cycles [Hardware event]<br />
instructions [Hardware event]<br />
cache-references [Hardware event]<br />
cache-misses [Hardware event]<br />
branch-instructions OR branches [Hardware event]<br />
branch-misses [Hardware event]<br />
bus-cycles [Hardware event]<br />
<br />
cpu-clock [Software event]<br />
task-clock [Software event]<br />
page-faults OR faults [Software event]<br />
minor-faults [Software event]<br />
major-faults [Software event]<br />
context-switches OR cs [Software event]<br />
cpu-migrations OR migrations [Software event]<br />
alignment-faults [Software event]<br />
emulation-faults [Software event]<br />
<br />
L1-dcache-loads [Hardware cache event]<br />
L1-dcache-load-misses [Hardware cache event]<br />
L1-dcache-stores [Hardware cache event]<br />
L1-dcache-store-misses [Hardware cache event]<br />
L1-dcache-prefetches [Hardware cache event]<br />
L1-dcache-prefetch-misses [Hardware cache event]<br />
L1-icache-loads [Hardware cache event]<br />
L1-icache-load-misses [Hardware cache event]<br />
L1-icache-prefetches [Hardware cache event]<br />
L1-icache-prefetch-misses [Hardware cache event]<br />
LLC-loads [Hardware cache event]<br />
LLC-load-misses [Hardware cache event]<br />
LLC-stores [Hardware cache event]<br />
LLC-store-misses [Hardware cache event]<br />
<br />
LLC-prefetch-misses [Hardware cache event]<br />
dTLB-loads [Hardware cache event]<br />
dTLB-load-misses [Hardware cache event]<br />
dTLB-stores [Hardware cache event]<br />
dTLB-store-misses [Hardware cache event]<br />
dTLB-prefetches [Hardware cache event]<br />
dTLB-prefetch-misses [Hardware cache event]<br />
iTLB-loads [Hardware cache event]<br />
iTLB-load-misses [Hardware cache event]<br />
branch-loads [Hardware cache event]<br />
branch-load-misses [Hardware cache event]<br />
<br />
rNNN (see 'perf list --help' on how to encode it) [Raw hardware event descriptor]<br />
<br />
mem:<addr>[:access] [Hardware breakpoint]<br />
<br />
kvmmmu:kvm_mmu_pagetable_walk [Tracepoint event]<br />
<br />
[...]<br />
<br />
sched:sched_stat_runtime [Tracepoint event]<br />
sched:sched_pi_setprio [Tracepoint event]<br />
syscalls:sys_enter_socket [Tracepoint event]<br />
syscalls:sys_exit_socket [Tracepoint event]<br />
<br />
[...]<br />
<br />
</pre><br />
<br />
An event can have sub-events (or unit masks). On some processors and for some events,<br />
it may be possible to combine unit masks and measure when either sub-event occurs.<br />
Finally, an event can have modifiers, i.e., filters which alter when or how the event is<br />
counted.<br />
<br />
==== Hardware events ====<br />
<br />
PMU hardware events are CPU specific and documented by the CPU vendor. The <tt>perf</tt> tool, if linked against the <tt>libpfm4</tt><br />
library, provides some short description of the events. For a listing of PMU hardware events for Intel and AMD<br />
processors, see<br />
<br />
* Intel PMU event tables: Appendix A of manual [http://www.intel.com/Assets/PDF/manual/253669.pdf here]<br />
* AMD PMU event table: section 3.14 of manual [http://support.amd.com/us/Processor_TechDocs/31116.pdf here]<br />
<br />
== Counting with <tt>perf stat</tt> ==<br />
For any of the supported events, perf can keep a running count during process execution.<br />
In counting modes, the occurrences of events are simply aggregated and presented on standard<br />
output at the end<br />
of an application run.<br />
To generate these statistics, use the <tt>stat</tt> command of <tt>perf</tt>. For instance:<br />
<pre><br />
perf stat -B dd if=/dev/zero of=/dev/null count=1000000<br />
<br />
1000000+0 records in<br />
1000000+0 records out<br />
512000000 bytes (512 MB) copied, 0.956217 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':<br />
<br />
5,099 cache-misses # 0.005 M/sec (scaled from 66.58%)<br />
235,384 cache-references # 0.246 M/sec (scaled from 66.56%)<br />
9,281,660 branch-misses # 3.858 % (scaled from 33.50%)<br />
240,609,766 branches # 251.559 M/sec (scaled from 33.66%)<br />
1,403,561,257 instructions # 0.679 IPC (scaled from 50.23%)<br />
2,066,201,729 cycles # 2160.227 M/sec (scaled from 66.67%)<br />
217 page-faults # 0.000 M/sec<br />
3 CPU-migrations # 0.000 M/sec<br />
83 context-switches # 0.000 M/sec<br />
956.474238 task-clock-msecs # 0.999 CPUs<br />
<br />
0.957617512 seconds time elapsed<br />
<br />
</pre><br />
With no events specified, <tt>perf stat</tt> collects the common events listed above. Some are software<br />
events, such as <tt>context-switches</tt>, others are generic hardware events such as <tt>cycles</tt>.<br />
After the hash sign, derived metrics may be presented, such as 'IPC' (instructions per cycle).<br />
<br />
=== Options controlling event selection ===<br />
<br />
It is possible to measure one or more events per run of the <tt>perf</tt> tool. Events are designated<br />
using their symbolic names followed by optional unit masks and modifiers. Event names, unit masks,<br />
and modifiers are case insensitive.<br />
<br />
By default, events are measured at '''both''' user and kernel levels:<br />
<pre><br />
perf stat -e cycles dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure only at the user level, it is necessary to pass a modifier:<br />
<pre><br />
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
To measure both user and kernel (explicitly):<br />
<pre><br />
perf stat -e cycles:uk dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
==== Modifiers ====<br />
<br />
Events can optionally have a modifier by appending a colon and one or more modifiers.<br />
Modifiers allow the user to restrict when events are counted.<br />
<br />
To measure a PMU event and pass modifiers:<br />
<pre><br />
perf stat -e instructions:u dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
In this example, we are measuring the number of instructions at the user level.<br />
Note that for actual events, the modifiers depends on the underlying PMU model.<br />
All modifiers can be combined at will.<br />
Here is a simple table to summarize the most common modifiers for Intel and<br />
AMD x86 processors.<br />
<br />
{| border="1"<br />
! Modifiers<br />
! Description<br />
! Example<br />
|- <br />
|u || monitor at priv level 3, 2, 1 (user)|| event:u<br />
|- <br />
|k || monitor at priv level 0 (kernel) || event:k<br />
|- <br />
|h || monitor hypervisor events on a virtualization environment || event:h<br />
|-<br />
|H || monitor host machine on a virtualization environment || event:H<br />
|- <br />
|G || monitor guest machine on a virtualization environment || event:G<br />
|}<br />
<br />
All modifiers above are considered as a boolean (flag).<br />
<br />
==== Hardware events ====<br />
<br />
To measure an actual PMU as provided by the HW vendor documentation, pass the hexadecimal parameter code:<br />
<pre><br />
perf stat -e r1a8 -a sleep 1<br />
<br />
Performance counter stats for 'sleep 1':<br />
<br />
210,140 raw 0x1a8<br />
1.001213705 seconds time elapsed<br />
</pre><br />
<br />
==== multiple events ====<br />
<br />
To measure more than one event, simply provide a comma-separated list with no space:<br />
<pre><br />
perf stat -e cycles,instructions,cache-misses [...]<br />
</pre><br />
<br />
There is no theoretical limit in terms of the number of events that can be provided. If there are more<br />
events than there are actual hw counters, the kernel will automatically multiplex them. There<br />
is no limit of the number of software events. It is possible to simultaneously measure<br />
events coming from different sources.<br />
<br />
However, given that there is one file descriptor used per event and either per-thread (per-thread mode)<br />
or per-cpu (system-wide), it is possible to reach the maximum number of open file descriptor per process<br />
as imposed by the kernel. In that case, perf will report an error. See the troubleshooting section for<br />
help with this matter.<br />
<br />
==== multiplexing and scaling events ====<br />
<br />
If there are more events than counters, the kernel uses time multiplexing (switch frequency = <tt>HZ</tt>, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies<br />
to PMU events.<br />
With multiplexing, an event is '''not''' measured all the time. At the end of the run, the tool '''scales'''<br />
the count based on total time enabled vs time running. The actual formula is:<br />
<br />
<tt>final_count = raw_count * time_enabled/time_running</tt><br />
<br />
This provides an '''estimate''' of what the count would have been, had the event been measured during the<br />
entire run. It is '''very''' important to understand this is an '''estimate''' not an actual count.<br />
Depending on the workload, there will be blind spots which can introduce errors during<br />
scaling.<br />
<br />
Events are currently managed in round-robin fashion. Therefore each event will eventually get a chance<br />
to run. If there are N counters, then up to the first N events on the round-robin list are programmed into<br />
the PMU. In certain situations it may be less than that because some events may not be measured together<br />
or they compete for the same counter.<br />
Furthermore, the perf_events interface allows multiple tools to measure the same thread or CPU at the<br />
same time. Each event is added to the same round-robin list. There is no guarantee that all events of<br />
a tool are stored sequentially in the list.<br />
<br />
To avoid scaling (in the presence of only one active perf_event user), one can try and reduce the number of<br />
events. The following table provides the number of counters for a few common processors:<br />
<br />
{| border="1"<br />
!Processor<br />
!Generic counters<br />
!Fixed counters<br />
|-<br />
|Intel Core || 2 || 3<br />
|- <br />
|Intel Nehalem|| 4 || 3<br />
|}<br />
<br />
Generic counters can measure any events. Fixed counters can only measure one event. Some counters<br />
may be reserved for special purposes, such as a watchdog timer.<br />
<br />
The following examples show the effect of scaling:<br />
<pre><br />
perf stat -B -e cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,812,305,464 cycles<br />
2,812,304,340 cycles<br />
<br />
1.302481065 seconds time elapsed<br />
<br />
</pre><br />
<br />
Here, there is no multiplexing and thus no scaling. Let's add one more event:<br />
<pre><br />
perf stat -B -e cycles,cycles,cycles ./noploop 1<br />
<br />
Performance counter stats for './noploop 1':<br />
<br />
2,809,725,593 cycles (scaled from 74.98%)<br />
2,810,797,044 cycles (scaled from 74.97%)<br />
2,809,315,647 cycles (scaled from 75.09%)<br />
<br />
1.295007067 seconds time elapsed<br />
<br />
</pre><br />
There was multiplexing and thus scaling.<br />
It can be interesting to try and pack events in a way that<br />
guarantees that event A and B are always measured together. Although the perf_events kernel interface<br />
provides support for event grouping, the current <tt>perf</tt> tool does '''not'''.<br />
<br />
==== Repeated measurement ====<br />
<br />
It is possible to use <tt>perf stat</tt> to run the same test workload multiple times and get for each count,<br />
the standard deviation from the mean.<br />
<br />
<pre><br />
perf stat -r 5 sleep 1<br />
<br />
Performance counter stats for 'sleep 1' (5 runs):<br />
<br />
<not counted> cache-misses<br />
20,676 cache-references # 13.046 M/sec ( +- 0.658% )<br />
6,229 branch-misses # 0.000 % ( +- 40.825% )<br />
<not counted> branches<br />
<not counted> instructions<br />
<not counted> cycles<br />
144 page-faults # 0.091 M/sec ( +- 0.139% )<br />
0 CPU-migrations # 0.000 M/sec ( +- -nan% )<br />
1 context-switches # 0.001 M/sec ( +- 0.000% )<br />
1.584872 task-clock-msecs # 0.002 CPUs ( +- 12.480% )<br />
<br />
1.002251432 seconds time elapsed ( +- 0.025% )<br />
<br />
</pre><br />
Here, <tt>sleep</tt> is run 5 times and the mean count for each event, along<br />
with ratio of std-dev/mean is printed.<br />
<br />
=== Options controlling environment selection ===<br />
<br />
The <tt>perf</tt> tool can be used to count events on a per-thread, per-process, per-cpu<br />
or system-wide basis.<br />
In ''per-thread'' mode, the counter only monitors the execution of a designated thread.<br />
When the thread is scheduled out, monitoring stops. When a thread migrated from one<br />
processor to another, counters are saved on the current processor and are restored<br />
on the new one.<br />
<br />
The ''per-process'' mode is a variant of per-thread where '''all''' threads of the process<br />
are monitored. Counts and samples are aggregated at the process level.<br />
The perf_events interface allows for automatic inheritance on <tt>fork()</tt> and <tt>pthread_create()</tt>.<br />
By default, the perf tool '''activates''' inheritance.<br />
<br />
In ''per-cpu'' mode, all threads running on the designated processors are monitored. Counts and<br />
samples are thus aggregated per CPU. An event is only monitoring one CPU at a time. To monitor<br />
across multiple processors, it is necessary to create multiple events. The perf tool can aggregate<br />
counts and samples across multiple processors. It can also monitor only a subset of the processors.<br />
<br />
==== Counting and inheritance ====<br />
<br />
By default, <tt>perf stat</tt> counts for all threads of the process and subsequent child processes and<br />
threads. This can be altered using the <tt>-i</tt> option. It is not possible to obtain a count breakdown per-thread or per-process.<br />
<br />
==== Processor-wide mode ====<br />
<br />
By default, <tt>perf stat</tt> counts in per-thread mode. To count on a per-cpu basis pass<br />
the <tt>-a</tt> option. When it is specified by itself, all online processors are monitored and counts are<br />
aggregated. For instance:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a dd if=/dev/zero of=/dev/null count=2000000<br />
<br />
2000000+0 records in<br />
2000000+0 records out<br />
1024000000 bytes (1.0 GB) copied, 1.91559 s, 535 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=2000000':<br />
<br />
1,993,541,603 cycles<br />
764,086,803 instructions # 0.383 IPC<br />
<br />
1.916930613 seconds time elapsed<br />
</pre><br />
This measurement collects events <tt>cycles</tt> and <tt>instructions</tt> across all CPUs.<br />
The duration of the measurement is determined by the execution of <tt>dd</tt>.<br />
In other words, this measurement captures execution of the <tt>dd</tt> process '''and''' anything else<br />
than runs at the user level on all CPUs.<br />
<br />
To time the duration of the measurement without actively consuming cycles, it is possible to use the<br />
=/usr/bin/sleep= command:<br />
<pre><br />
perf stat -B -ecycles:u,instructions:u -a sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
766,271,289 cycles<br />
596,796,091 instructions # 0.779 IPC<br />
<br />
5.001191353 seconds time elapsed<br />
<br />
</pre><br />
<br />
It is possible to restrict monitoring to a subset of the CPUS using the <tt>-C</tt> option. A list of CPUs<br />
to monitor can be passed. For instance, to measure on CPU0, CPU2 and CPU3:<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 0,2-3 sleep 5<br />
</pre><br />
The demonstration machine has only two CPUs, but we can limit to CPU 1.<br />
<pre><br />
perf stat -B -e cycles:u,instructions:u -a -C 1 sleep 5<br />
<br />
Performance counter stats for 'sleep 5':<br />
<br />
301,141,166 cycles<br />
225,595,284 instructions # 0.749 IPC<br />
<br />
5.002125198 seconds time elapsed<br />
<br />
</pre><br />
Counts are aggregated across all the monitored CPUs. Notice how the number of counted<br />
cycles and instructions are both halved when measuring a single CPU.<br />
<br />
==== Attaching to a running process ====<br />
<br />
It is possible to use perf to attach to an already running thread or process. This requires the permission<br />
to attach along with the thread or process ID. To attach to a process, the <tt>-p</tt> option must be<br />
the process ID. To attach to the sshd service that is commonly running on many Linux machines, issue:<br />
<pre><br />
ps ax | fgrep sshd<br />
<br />
2262 ? Ss 0:00 /usr/sbin/sshd -D<br />
2787 pts/0 S+ 0:00 fgrep --color=auto sshd<br />
<br />
perf stat -e cycles -p 2262 sleep 2<br />
<br />
Performance counter stats for process id '2262':<br />
<br />
<not counted> cycles<br />
<br />
2.001263149 seconds time elapsed<br />
<br />
</pre><br />
What determines the duration of the measurement is the command to execute. Even though we are<br />
attaching to a process, we can still pass the name of a command. It is used to time the measurement.<br />
Without it, <tt>perf</tt> monitors until it is killed.<br />
Also note that when attaching to a process, all threads of the process are monitored. Furthermore,<br />
given that inheritance is on by default, child processes or threads will also be monitored. To turn<br />
this off, you must use the <tt>-i</tt> option.<br />
It is possible to attach a specific thread within a process. By thread, we mean kernel visible thread.<br />
In other words, a thread visible by the <tt>ps</tt> or <tt>top</tt> commands. To attach to a thread, the <tt>-t</tt><br />
option must be used. We look at <tt>rsyslogd</tt>, because it always runs on Ubuntu 11.04, with<br />
multiple threads.<br />
<br />
<pre><br />
ps -L ax | fgrep rsyslogd | head -5<br />
<br />
889 889 ? Sl 0:00 rsyslogd -c4<br />
889 932 ? Sl 0:00 rsyslogd -c4<br />
889 933 ? Sl 0:00 rsyslogd -c4<br />
2796 2796 pts/0 S+ 0:00 fgrep --color=auto rsyslogd<br />
<br />
perf stat -e cycles -t 932 sleep 2<br />
<br />
Performance counter stats for thread id '932':<br />
<br />
<not counted> cycles<br />
<br />
2.001037289 seconds time elapsed<br />
<br />
</pre><br />
In this example, the thread 932 did not run during the 2s of the measurement. Otherwise, we would<br />
see a count value. Attaching to kernel threads is possible, though not really recommended. Given that kernel threads tend<br />
to be pinned to a specific CPU, it is best to use the cpu-wide mode.<br />
<br />
<br />
=== Options controlling output ===<br />
<tt>perf stat</tt> can modify output to suit different needs.<br />
<br />
==== Pretty printing large numbers ====<br />
<br />
For most people, it is hard to read large numbers. With <tt>perf stat</tt>, it is possible to print<br />
large numbers using the comma separator for thousands (US-style). For that the <tt>-B</tt><br />
option and the correct locale for <tt>LC_NUMERIC</tt> must be set. As the above example showed, Ubuntu<br />
already sets the locale information correctly. An explicit call looks as follows:<br />
<br />
<pre><br />
LC_NUMERIC=en_US.UTF8 perf stat -B -e cycles:u,instructions:u dd if=/dev/zero of=/dev/null count=10000000<br />
<br />
100000+0 records in<br />
100000+0 records out<br />
51200000 bytes (51 MB) copied, 0.0971547 s, 527 MB/s<br />
<br />
Performance counter stats for 'dd if=/dev/zero of=/dev/null count=100000':<br />
<br />
96,551,461 cycles<br />
38,176,009 instructions # 0.395 IPC<br />
<br />
0.098556460 seconds time elapsed<br />
<br />
</pre><br />
<br />
==== Machine readable output ====<br />
<tt>perf stat</tt> can also print counts in a format that can easily be imported<br />
into a spreadsheet or parsed by scripts. The <tt>-x</tt> option alters the format of the output and allows users to pass a field<br />
delimiter. This makes is easy to produce CSV-style output:<br />
<pre><br />
perf stat -x, date<br />
<br />
Thu May 26 21:11:07 EDT 2011<br />
884,cache-misses<br />
32559,cache-references<br />
<not counted>,branch-misses<br />
<not counted>,branches<br />
<not counted>,instructions<br />
<not counted>,cycles<br />
188,page-faults<br />
2,CPU-migrations<br />
0,context-switches<br />
2.350642,task-clock-msecs<br />
</pre><br />
<br />
Note that the <tt>-x</tt> option is not compatible with <tt>-B</tt>.<br />
<br />
== Sampling with <tt>perf record</tt> ==<br />
<br />
The <tt>perf</tt> tool can be used to collect profiles on per-thread, per-process and per-cpu basis.<br />
<br />
There are several commands associated with sampling: <tt>record</tt>, <tt>report</tt>, <tt>annotate</tt>.<br />
You must first collect the samples using <tt>perf record</tt>. This generates an output<br />
file called <tt>perf.data</tt>. That file can then be analyzed, possibly on another machine, using<br />
the <tt>perf report</tt> and <tt>perf annotate</tt> commands. The model is fairly similar to that of<br />
OProfile.<br />
<br />
=== Event-based sampling overview ===<br />
<br />
Perf_events is based on event-based sampling. The period is expressed as the number of occurrences<br />
of an event, not the number of timer ticks.<br />
A sample is recorded when the sampling counter overflows, i.e., wraps from 2^64 back to 0.<br />
No PMU implements 64-bit hardware counters, but perf_events emulates such counters in software.<br />
<br />
The way perf_events emulates 64-bit counter is limited to expressing sampling periods<br />
using the number of bits in the actual hardware counters. If this is smaller than 64, the kernel '''silently''' truncates<br />
the period in this case. Therefore, it is best if the period is always smaller than 2^31 if running<br />
on 32-bit systems.<br />
<br />
On counter overflow, the kernel records information, i.e., a sample, about the execution of the<br />
program. What gets recorded depends on the type of measurement. This is all specified by the<br />
user and the tool. But the key information that is common in all samples is the instruction pointer,<br />
i.e. where was the program when it was interrupted.<br />
<br />
Interrupt-based sampling introduces skids on modern processors. That means that the instruction pointer<br />
stored in each sample designates the place where the program was<br />
interrupted to process the PMU interrupt, not the place where the counter actually overflows, i.e.,<br />
where it was at the end of the sampling period. In some case, the distance between those two points<br />
may be several dozen instructions or more if there were taken branches. When the program cannot<br />
make forward progress, those two locations are indeed identical. ''For this reason, care must be taken<br />
when interpreting profiles''.<br />
<br />
==== Default event: cycle counting ====<br />
<br />
By default, <tt>perf record</tt> uses the <tt>cycles</tt> event as the sampling event.<br />
This is a generic hardware event that is mapped to a hardware-specific<br />
PMU event by the kernel. For Intel, it is mapped to <tt>UNHALTED_CORE_CYCLES</tt>. This event<br />
does not maintain a constant correlation to time in the presence of CPU frequency scaling.<br />
Intel provides another event, called <tt>UNHALTED_REFERENCE_CYCLES</tt> but this event is NOT<br />
currently available with perf_events.<br />
<br />
On AMD systems, the event is mapped to <tt>CPU_CLK_UNHALTED</tt><br />
and this event is also subject to frequency scaling.<br />
On any Intel or AMD processor, the <tt>cycle</tt> event does not count when the processor is idle, i.e.,<br />
when it calls <tt>mwait()</tt>.<br />
<br />
==== Period and rate ====<br />
<br />
The perf_events interface allows two modes to express the sampling period:<br />
<br />
* the number of occurrences of the event (period)<br />
* the average rate of samples/secĀ (frequency)<br />
<br />
The <tt>perf</tt> tool defaults to the average rate. It is set to 1000Hz, or 1000 samples/sec. That means<br />
that the kernel is dynamically adjusting the sampling period to achieve the target average rate.<br />
The adjustment in period is reported in the raw profile data.<br />
In contrast, with the other mode, the sampling period is set by the user and does not vary<br />
between samples.<br />
There is currently no support for sampling period randomization.<br />
<br />
=== Collecting samples ===<br />
<br />
By default, <tt>perf record</tt> operates in per-thread mode, with inherit mode enabled.<br />
The simplest mode looks as follows, when executing a simple program that busy loops:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.002 MB perf.data (~89 samples) ]<br />
</pre><br />
<br />
The example above collects samples for event <tt>cycles</tt> at an average target rate of 1000Hz.<br />
The resulting samples are saved into the <tt>perf.data</tt> file. If the file already existed, you may be prompted<br />
to pass <tt>-f</tt> to overwrite it. To put the results in a specific file, use the <tt>-o</tt> option.<br />
<br />
WARNING: The number of reported samples is only an '''estimate'''. It does not<br />
reflect the actual number of samples collected. The estimate is based on<br />
the number of bytes written to the <tt>perf.data</tt> file and the minimal sample size. But<br />
the size of each sample depends on the type of measurement. Some samples are generated<br />
by the counters themselves but others are recorded to support symbol correlation during<br />
post-processing, e.g., <tt>mmap()</tt> information.<br />
<br />
To get an accurate number of samples for the <tt>perf.data</tt> file, it is possible to use the <tt>perf report</tt><br />
command:<br />
<pre><br />
perf record ./noploop 1<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.058 MB perf.data (~2526 samples) ]<br />
perf report -D -i perf.data | fgrep RECORD_SAMPLE | wc -l<br />
<br />
1280<br />
<br />
</pre><br />
<br />
To specify a custom rate, it is necessary to use the <tt>-F</tt> option. For instance,<br />
to sample on event <tt>instructions</tt> only at the user level and<br />
at an average rate of 250 samples/sec:<br />
<pre><br />
perf record -e instructions:u -F 250 ./noploop 4<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.049 MB perf.data (~2160 samples) ]<br />
<br />
</pre><br />
<br />
To specify a sampling period, instead, the <tt>-c</tt> option must be used. For instance,<br />
to collect a sample every 2000 occurrences of event <tt>instructions</tt> only at the user level<br />
only:<br />
<pre><br />
perf record -e retired_instructions:u -c 2000 ./noploop 4<br />
<br />
[ perf record: Woken up 55 times to write data ]<br />
[ perf record: Captured and wrote 13.514 MB perf.data (~590431 samples) ]<br />
<br />
</pre><br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode mode, samples are collected for all threads executing on the monitored<br />
CPU. To switch <tt>perf record</tt> in per-cpu mode, the <tt>-a</tt> option must be used. By default<br />
in this mode, '''ALL''' online CPUs are monitored. It is possible to restrict to the a subset<br />
of CPUs using the <tt>-C</tt> option, as explained with <tt>perf stat</tt> above.<br />
<br />
To sample on <tt>cycles</tt> at both user and kernel levels for 5s on all CPUS with an average<br />
target rate of 1000 samples/sec:<br />
<pre><br />
perf record -a -F 1000 sleep 5<br />
<br />
[ perf record: Woken up 1 times to write data ]<br />
[ perf record: Captured and wrote 0.523 MB perf.data (~22870 samples) ]<br />
<br />
</pre><br />
<br />
== Sample analysis with <tt>perf report</tt> ==<br />
<br />
Samples collected by <tt>perf record</tt> are saved into a binary file called, by default, <tt>perf.data</tt>.<br />
The <tt>perf report</tt> command reads this file and generates<br />
a concise execution profile. By default, samples are sorted by functions with the most samples first.<br />
It is possible to customize the sorting order and therefore to view the data differently.<br />
<br />
<pre><br />
perf report<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Command Shared Object<br />
Symbol<br />
# ........ ............... ..............................<br />
.....................................<br />
#<br />
28.15% firefox-bin libxul.so [.] 0xd10b45<br />
4.45% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.26% swapper [kernel.kallsyms] [k] read_hpet<br />
2.13% firefox-bin firefox-bin [.] 0x1e3d<br />
1.40% unity-panel-ser libglib-2.0.so.0.2800.6 [.] 0x886f1<br />
[...]<br />
</pre><br />
<br />
The column 'Overhead' indicates the percentage of the overall samples collected in the corresponding function.<br />
The second column reports the process from which the samples were collected. In per-thread/per-process<br />
mode, this is always the name of the monitored command. But in cpu-wide mode, the command can vary.<br />
The third column shows the name of the ELF image where the samples came from. If a program is dynamically<br />
linked, then this may show the name of a shared library. When the samples come from the kernel, then<br />
the pseudo ELF image name <tt>[kernel.kallsyms]</tt> is used. The fourth column indicates the privilege level<br />
at which the sample was taken, i.e. when the program was running when it was interrupted:<br />
<br />
* [.] : user level<br />
* [k]: kernel level<br />
* [g]: guest kernel level (virtualization)<br />
* [u]: guest os user space<br />
* [H]: hypervisor<br />
<br />
The final column shows the symbol name.<br />
<br />
There are many different ways samples can be presented, i.e., sorted.<br />
To sort by shared objects, i.e., dsos:<br />
<pre><br />
perf report --sort=dso<br />
<br />
# Events: 1K cycles<br />
#<br />
# Overhead Shared Object<br />
# ........ ..............................<br />
#<br />
38.08% [kernel.kallsyms]<br />
28.23% libxul.so<br />
3.97% libglib-2.0.so.0.2800.6<br />
3.72% libc-2.13.so<br />
3.46% libpthread-2.13.so<br />
2.13% firefox-bin<br />
1.51% libdrm_intel.so.1.0.0<br />
1.38% dbus-daemon<br />
1.36% [drm]<br />
[...]<br />
</pre><br />
<br />
<br />
=== Options controlling output ===<br />
<br />
To make the output easier to parse, it is possible to change the column separator<br />
to a single character:<br />
<pre><br />
perf report -t<br />
</pre><br />
<br />
=== Options controlling kernel reporting ===<br />
The <tt>perf</tt> tool does not know how to extract symbols form compressed kernel images (vmlinuz). Therefore, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf report -k /tmp/vmlinux<br />
</pre><br />
Of course, this works only if the kernel is compiled to with debug symbols.<br />
<br />
=== Processor-wide mode ===<br />
<br />
In per-cpu mode, samples are recorded from all threads running on the monitored<br />
CPUs. As as result, samples from many different processes may be collected.<br />
For instance, if we monitor across all CPUs for 5s:<br />
<pre><br />
perf record -a sleep 5<br />
perf report<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead Command Shared Object Symbol<br />
# ........ ............... ....................................................................<br />
#<br />
13.20% swapper [kernel.kallsyms] [k] read_hpet<br />
7.53% swapper [kernel.kallsyms] [k] mwait_idle_with_hints<br />
4.40% perf_2.6.38-8 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore<br />
4.07% perf_2.6.38-8 perf_2.6.38-8 [.] 0x34e1b<br />
3.88% perf_2.6.38-8 [kernel.kallsyms] [k] format_decode<br />
[...]<br />
</pre><br />
<br />
When the symbol is printed as an hexadecimal address, this is because the ELF image does not<br />
have a symbol table. This happens when binaries are stripped.<br />
We can sort by cpu as well. This could be useful to determine if the workload is well balanced:<br />
<pre><br />
perf report --sort=cpu<br />
<br />
# Events: 354 cycles<br />
#<br />
# Overhead CPU<br />
# ........ ...<br />
#<br />
65.85% 1<br />
34.15% 0<br />
</pre><br />
<br />
== Source level analysis with <tt>perf annotate</tt> ==<br />
<br />
It is possible to drill down to the instruction level with <tt>perf annotate</tt>.<br />
For that, you need to invoke <tt>perf annotate</tt> with the name of the command to annotate.<br />
All the functions with samples will be disassembled and each instruction will have its relative<br />
percentage of samples reported:<br />
<pre><br />
perf record ./noploop 5<br />
perf annotate -d ./noploop<br />
<br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop.noggdb<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
15.08 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.52 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
14.27 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.13 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
The first column reports the percentage of samples for function ==noploop()== captured for at that instruction.<br />
As explained earlier, you should interpret this information carefully.<br />
<br />
<tt>perf annotate</tt> can generate sourcecode level information if the application is compiled with <tt>-ggdb</tt>. The following<br />
snippet shows the much more informative output for the same execution of <tt>noploop</tt> when compiled with this debugging<br />
information.<br />
<pre><br />
------------------------------------------------<br />
Percent | Source code & Disassembly of noploop<br />
------------------------------------------------<br />
:<br />
:<br />
:<br />
: Disassembly of section .text:<br />
:<br />
: 08048484 <main>:<br />
: #include <string.h><br />
: #include <unistd.h><br />
: #include <sys/time.h><br />
:<br />
: int main(int argc, char **argv)<br />
: {<br />
0.00 : 8048484: 55 push %ebp<br />
0.00 : 8048485: 89 e5 mov %esp,%ebp<br />
[...]<br />
0.00 : 8048530: eb 0b jmp 804853d <main+0xb9><br />
: count++;<br />
14.22 : 8048532: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
0.00 : 8048536: 83 c0 01 add $0x1,%eax<br />
14.78 : 8048539: 89 44 24 2c mov %eax,0x2c(%esp)<br />
: memcpy(&tv_end, &tv_now, sizeof(tv_now));<br />
: tv_end.tv_sec += strtol(argv[1], NULL, 10);<br />
: while (tv_now.tv_sec < tv_end.tv_sec ||<br />
: tv_now.tv_usec < tv_end.tv_usec) {<br />
: count = 0;<br />
: while (count < 100000000UL)<br />
14.78 : 804853d: 8b 44 24 2c mov 0x2c(%esp),%eax<br />
56.23 : 8048541: 3d ff e0 f5 05 cmp $0x5f5e0ff,%eax<br />
0.00 : 8048546: 76 ea jbe 8048532 <main+0xae><br />
[...]<br />
</pre><br />
<br />
=== Using <tt>perf annotate</tt> on kernel code ===<br />
<br />
The <tt>perf</tt> tool does not know how to extract symbols from compressed kernel images (vmlinuz).<br />
As in the case of <tt>perf report</tt>, users<br />
must pass the path of the uncompressed kernel using the <tt>-k</tt> option:<br />
<pre><br />
perf annotate -k /tmp/vmlinux -d symbol<br />
</pre><br />
Again, this only works if the kernel is compiled to with debug symbols.<br />
<br />
== Live analysis with <tt>perf top</tt> ==<br />
<br />
The perf tool can operate in a mode similar to the Linux <tt>top</tt> tool,<br />
printing sampled functions in real time.<br />
The default sampling event is <tt>cycles</tt> and default order<br />
is descending number of samples per symbol, thus <tt>perf top</tt> shows the functions<br />
where most of the time is spent.<br />
By default, <tt>perf top</tt> operates in processor-wide mode, monitoring<br />
all online CPUs at both user and kernel levels. It is possible to monitor only<br />
a subset of the CPUS using the <tt>-C</tt> option.<br />
<br />
<pre><br />
perf top<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 260 irqs/sec kernel:61.5% exact: 0.0% [1000Hz<br />
cycles], (all, 2 CPUs)<br />
-------------------------------------------------------------------------------------------------------------------------------------------------------<br />
<br />
samples pcnt function DSO<br />
_______ _____ ______________________________ ___________________________________________________________<br />
<br />
80.00 23.7% read_hpet [kernel.kallsyms]<br />
14.00 4.2% system_call [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_lock [kernel.kallsyms]<br />
14.00 4.2% __ticket_spin_unlock [kernel.kallsyms]<br />
8.00 2.4% hpet_legacy_next_event [kernel.kallsyms]<br />
7.00 2.1% i8042_interrupt [kernel.kallsyms]<br />
7.00 2.1% strcmp [kernel.kallsyms]<br />
6.00 1.8% _raw_spin_unlock_irqrestore [kernel.kallsyms]<br />
6.00 1.8% pthread_mutex_lock /lib/i386-linux-gnu/libpthread-2.13.so<br />
6.00 1.8% fget_light [kernel.kallsyms]<br />
6.00 1.8% __pthread_mutex_unlock_usercnt /lib/i386-linux-gnu/libpthread-2.13.so<br />
5.00 1.5% native_sched_clock [kernel.kallsyms]<br />
5.00 1.5% drm_addbufs_sg /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
</pre><br />
By default, the first column shows the aggregated number of samples since the beginning of the<br />
run. By pressing the 'Z' key, this can be changed to print the number of samples since the last<br />
refresh. Recall that the <tt>cycle</tt> event counts CPU cycles when the<br />
processor is not in halted state, i.e. not idle. Therefore this is '''not''' equivalent to<br />
wall clock time. Furthermore, the event is also subject to frequency scaling.<br />
<br />
It is also possible to drill down into single functions to see which instructions<br />
have the most samples.<br />
To drill down into a specify function, press the 's' key and enter the name of the function.<br />
Here we selected the top function <tt>noploop</tt> (not shown above):<br />
<pre><br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
PerfTop: 2090 irqs/sec kernel:50.4% exact: 0.0% [1000Hz cycles], (all, 16 CPUs)<br />
------------------------------------------------------------------------------------------------------------------------------------------<br />
Showing cycles for noploop<br />
Events Pcnt (>=5%)<br />
0 0.0% 00000000004003a1 <noploop>:<br />
0 0.0% 4003a1: 55 push %rbp<br />
0 0.0% 4003a2: 48 89 e5 mov %rsp,%rbp<br />
3550 100.0% 4003a5: eb fe jmp 4003a5 <noploop+0x4><br />
<br />
</pre><br />
<br />
== Benchmarking with <tt>perf bench</tt> ==<br />
<br />
The <tt>perf bench</tt> command includes a number of multi-threaded microbenchmarks<br />
to exercise different subsystems in the Linux kernel and system calls. This allows<br />
hackers to easily stress and measure the impact of changes, and therefore help mitigate<br />
performance regressions.<br />
<br />
It also serves as a general benchmark framework, enabling developers to easily create<br />
test cases and transparently integrate and make use of the rich perf tool subsystem.<br />
<br />
=== sched: Scheduler and IPC benchmarks ===<br />
<br />
=== mem: Memory access benchmarks ===<br />
<br />
=== numa: NUMA scheduling and MM benchmarks ===<br />
<br />
=== futex: Futex stressing benchmarks ===<br />
<br />
Deals with finer grained aspects of the kernel's implementation of futexes. It is mostly <br />
useful for kernel hacking. It currently supports wakeup and requeue/wait operations, as <br />
well as stressing the hashing scheme for both private and shared futexes. An example run<br />
for nCPU threads, each handling 1024 futexes measuring the hashing logic:<br />
<br />
<pre><br />
$ perf bench futex hash<br />
# Running 'futex/hash' benchmark:<br />
Run summary [PID 17428]: 4 threads, each operating on 1024 [private] futexes for 10 secs.<br />
<br />
[thread 0] futexes: 0x2775700 ... 0x27766fc [ 3343462 ops/sec ]<br />
[thread 1] futexes: 0x2776920 ... 0x277791c [ 3679539 ops/sec ]<br />
[thread 2] futexes: 0x2777ab0 ... 0x2778aac [ 3589836 ops/sec ]<br />
[thread 3] futexes: 0x2778c40 ... 0x2779c3c [ 3563827 ops/sec ]<br />
<br />
Averaged 3544166 operations/sec (+- 2.01%), total secs = 10<br />
</pre><br />
<br />
== Troubleshooting and Tips ==<br />
<br />
This section lists a number of tips to avoid common pitfalls when using perf.<br />
<br />
=== Open file limits ===<br />
<br />
The design of the perf_event kernel interface which is used by the perf tool, is such that it uses one file descriptor<br />
per event per-thread or per-cpu.<br />
<br />
On a 16-way system, when you do:<br />
<pre><br />
perf stat -e cycles sleep 1<br />
</pre><br />
You are effectively creating 16 events, and thus consuming 16 file descriptors.<br />
<br />
In per-thread mode, when you are sampling a process with 100 threads on<br />
the same 16-way system:<br />
<pre><br />
perf record -e cycles my_hundred_thread_process<br />
</pre><br />
Then, once all the threads are created, you end up with 100 * 1 (event) * 16 (cpus) = 1600 file descriptors.<br />
Perf creates one instance of the event on each CPU. Only when the thread executes<br />
on that CPU does the event effectively measure. This approach enforces sampling buffer locality and thus<br />
mitigates sampling overhead. At the end of the run, the tool aggregates all the samples into a single output file.<br />
<br />
In case perf aborts with 'too many open files' error, there are a few solutions:<br />
<br />
* increase the number of per-process open files using ulimit -n. Caveat: you must be root<br />
* limit the number of events you measure in one run<br />
* limit the number of CPU you are measuring<br />
<br />
==== increasing open file limit ====<br />
<br />
The superuser can override the per-process open file limit using the <tt>ulimit</tt> shell builtin command:<br />
<pre><br />
ulimit -a<br />
[...]<br />
open files (-n) 1024<br />
[...]<br />
<br />
ulimit -n 2048<br />
ulimit -a<br />
[...]<br />
open files (-n) 2048<br />
[...]<br />
</pre><br />
<br />
<br />
=== Binary identification with <tt>build-id</tt> ===<br />
<br />
The <tt>perf record</tt> command saves in the <tt>perf.data</tt> unique identifiers for all ELF images relevant to the<br />
measurement. In per-thread mode, this includes all the ELF images of the monitored processes. In cpu-wide<br />
mode, it includes all running processes running on the system. Those unique identifiers are generated by the linker if<br />
the <tt>-Wl,--build-id</tt> option is used. Thus, they are called <tt>build-id</tt>.<br />
The <tt>build-id</tt> are a helpful tool when correlating instruction addresses to ELF images.<br />
To extract all <tt>build-id</tt> entries used in a <tt>perf.data</tt> file, issue:<br />
<pre><br />
perf buildid-list -i perf.data<br />
<br />
06cb68e95cceef1ff4e80a3663ad339d9d6f0e43 [kernel.kallsyms]<br />
e445a2c74bc98ac0c355180a8d770cd35deb7674 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/i915/i915.ko<br />
83c362c95642c3013196739902b0360d5cbb13c6 /lib/modules/2.6.38-8-generic/kernel/drivers/net/wireless/iwlwifi/iwlcore.ko<br />
1b71b1dd65a7734e7aa960efbde449c430bc4478 /lib/modules/2.6.38-8-generic/kernel/net/mac80211/mac80211.ko<br />
ae4d6ec2977472f40b6871fb641e45efd408fa85 /lib/modules/2.6.38-8-generic/kernel/drivers/gpu/drm/drm.ko<br />
fafad827c43e34b538aea792cc98ecfd8d387e2f /lib/i386-linux-gnu/ld-2.13.so<br />
0776add23cf3b95b4681e4e875ba17d62d30c7ae /lib/i386-linux-gnu/libdbus-1.so.3.5.4<br />
f22f8e683907b95384c5799b40daa455e44e4076 /lib/i386-linux-gnu/libc-2.13.so<br />
[...]<br />
</pre><br />
<br />
==== The <tt>build-id</tt> cache ====<br />
<br />
At the end of each run, the <tt>perf record</tt> command updates a <tt>build-id</tt> cache, with new entries for ELF images with samples.<br />
The cache contains:<br />
<br />
* <tt>build-id</tt> for ELF images with samples<br />
* copies of the ELF images with samples<br />
<br />
Given that <tt>build-id</tt> are immutable, they uniquely identify a binary. If a binary is recompiled, a new <tt>build-id</tt> is generated<br />
and a new copy of the ELF images is saved in the cache.<br />
The cache is saved on disk in a directory which is by default $HOME/.debug. There is a global configuration file ==/etc/perfconfig==<br />
which can be used by sysadmin to specify an alternate global directory for the cache:<br />
<pre><br />
$ cat /etc/perfconfig<br />
[buildid]<br />
dir = /var/tmp/.debug<br />
</pre><br />
<br />
In certain situations it may be beneficial to turn off the <tt>build-id</tt> cache updates altogether. For that, you must pass the <tt>-N</tt> option to <tt>perf record</tt><br />
<pre><br />
perf record -N dd if=/dev/zero of=/dev/null count=100000<br />
</pre><br />
<br />
=== Access Control ===<br />
<br />
For some events, it is necessary to be <tt>root</tt> to invoke the <tt>perf</tt> tool. This document assumes<br />
that the user has root privileges. If you try to run perf with insufficient privileges, it will<br />
report<br />
<pre><br />
No permission to collect system-wide stats.<br />
</pre><br />
<br />
== Other Scenarios ==<br />
<br />
===Profiling sleep times===<br />
<br />
This feature shows where and how long a program is sleeping or waiting something.<br />
<br />
The first step is collecting data. We need to collect sched_stat and sched_switch events. Sched_stat events are not enough, because they are generated in the context of a task, which wakes up a target task (e.g. releases a lock). We need the same event but with a call-chain of the target task. This call-chain can be extracted from a previous sched_switch event.<br />
<br />
The second step is merging sched_start and sched_switch events. It can be done with help of "perf inject -s".<br />
<br />
$ ./perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o ~/perf.data.raw ~/foo<br />
$ ./perf inject -v -s -i ~/perf.data.raw -o ~/perf.data<br />
$ ./perf report --stdio --show-total-period -i ~/perf.data<br />
# Overhead Period Command Shared Object Symbol<br />
# ........ ............ ....... ................. ..............<br />
#<br />
100.00% 502408738 foo [kernel.kallsyms] [k] __schedule<br />
|<br />
--- __schedule<br />
schedule<br />
| <br />
|--79.85%-- schedule_hrtimeout_range_clock<br />
| schedule_hrtimeout_range<br />
| poll_schedule_timeout<br />
| do_select<br />
| core_sys_select<br />
| sys_select<br />
| system_call_fastpath<br />
| __select<br />
| __libc_start_main<br />
| <br />
--20.15%-- do_nanosleep<br />
hrtimer_nanosleep<br />
sys_nanosleep<br />
system_call_fastpath<br />
__GI___libc_nanosleep<br />
__libc_start_main<br />
<br />
$cat foo.c<br />
...<br />
for (i = 0; i < 10; i++) {<br />
ts1.tv_sec = 0;<br />
ts1.tv_nsec = 10000000;<br />
nanosleep(&ts1, NULL);<br />
<br />
tv1.tv_sec = 0;<br />
tv1.tv_usec = 40000;<br />
select(0, NULL, NULL, NULL,&tv1);<br />
}<br />
...<br />
<br />
== Other Resources ==<br />
<br />
=== Linux sourcecode ===<br />
The <tt>perf tools</tt> sourcecode lives in the Linux kernel tree under [http://lxr.linux.no/linux+v2.6.39/tools/perf/| <tt>/tools/perf</tt>]. You will find much more documentation in [http://lxr.linux.no/linux+v2.6.39/tools/perf/Documentation/ | <tt>/tools/perf/Documentation</tt>]. To build manpages, info pages and more, install these tools:<br />
<br />
* asciidoc<br />
* tetex-fonts<br />
* tetex-dvips<br />
* dialog<br />
* tetex<br />
* tetex-latex<br />
* xmltex<br />
* passivetex<br />
* w3m<br />
* xmlto<br />
<br />
and issue a <tt>make install-man</tt> from <tt>/tools/perf</tt>. This step is also required to <br />
be able to run <tt>perf help <command></tt>.<br />
<br />
----<br />
<br />
This guide is adapted from a tutorial by Stephane Eranian at Google, with contributions from Eric Gouriou, Tipp Moseley and Willem de Bruijn. The original content imported into wiki.perf.google.com is made available under the [http://creativecommons.org/licenses/by-sa/3.0/ CreativeCommons attribution sharealike 3.0 license].</div>Davidlohr Buesohttps://perf.wiki.kernel.org/index.php/Main_PageMain Page2014-05-30T18:27:44Z<p>Davidlohr Bueso: add perf-bench entry</p>
<hr />
<div>== <tt>perf:</tt> Linux profiling with performance counters ==<br />
''...More than just counters...''<br />
<br />
=== Introduction ===<br />
<br />
This is the wiki page for the <tt>perf</tt> performance counters subsystem in Linux.<br />
Performance counters are CPU hardware registers that <br />
count hardware events such <br />
as instructions executed, cache-misses suffered, or branches mispredicted. They form<br />
a basis for profiling applications to trace dynamic control flow and identify hotspots. <br />
<br />
<tt>perf</tt> provides rich generalized abstractions over hardware specific<br />
capabilities. Among others, it provides per task, per CPU and per-workload counters,<br />
sampling on top of these and source code event annotation.<br />
<br />
The userspace <tt>perf tools</tt> present a simple to use interface with commands like<br />
<br />
* <tt>[[Tutorial#Counting_with_perf_stat| perf stat</tt>]]: obtain event counts<br />
* <tt>[[Tutorial#Sampling_with_perf_record | perf record</tt>]]: record events for later reporting<br />
* <tt>[[Tutorial#Sample_analysis_with_perf_report | perf report</tt>]]: break down events by process, function, etc.<br />
* <tt>[[Tutorial#Source_level_analysis_with_perf_annotate | perf annotate</tt>]]: annotate assembly or source code with event counts<br />
* <tt>[[Tutorial#Live_analysis_with_perf_top | perf top</tt>]]: see live event count <br />
* <tt>[[Tutorial#Benchmarking_with_perf_bench | perf bench</tt>]]: run different kernel microbenchmarks<br />
<br />
To learn more, see the examples in the <tt>[[Tutorial]]</tt>.<br />
<br />
=== Wiki Contents ===<br />
<br />
* [[Tutorial]]<br />
* [[Todo]]<br />
* [[HardwareReference]]<br />
* [[perf_events kernel ABI]]<br />
<br />
=== References/Useful links ===<br />
* <tt>[http://indico.cern.ch/materialDisplay.py?contribId=20&sessionId=4&materialId=slides&confId=141309 Roberto Vitillo's presentation on Perf events]</div>Davidlohr Bueso