系统事件:
perf tool 支持一系列可计算的事件类型。该工具和底层内核接口能够监測来自不同来源的事件。
比如,一些事件是来源于纯粹的内核计数器,这些event在这样的情况下被称为软件事件。样例包含:context-switch、minor-fault等。
事件的还有一个来源是处理器本身及其性能监控装置(PMU)。
它提供了一系列事件来监測 micro-architectural 事件,如cycle的数量、指令退出、L1缓存不命中等等。
这些事件被称为PMU硬件事件或硬件事件。每种处理器型号都有各种不同的硬件事件类型。perf_events接口还提供了一组通用的硬件事件名称。
在每一个处理器,这些事件映射到实际的CPU提供的实际存在的事件类型,假设不存在则不能使用。
并且这些也被称为硬件事件(hardware events)和硬件缓存事件(hardware cache events)。
最后,还有各种 tracepoint事件是由内核ftrace框架实现的。并且这些事件都是2.6.3x或更加新的内核才支持。
Tracepoints:
Tracepoint 是散落在内核源码中的一些 hook。一旦使能。它们便能够在特定的代码被执行到时被触发,这一特性能够被各种 trace/debug 工具所使用。Perf 就是该特性的用户之中的一个。
假如您想知道在应用程序执行期间,内核内存管理模块的行为。便能够利用潜伏在 slab 分配器中的 tracepoint。当内核执行到这些 tracepoint 时。便会通知 perf。
Perf 将 tracepoint 产生的事件记录下来。生成报告,通过分析这些报告,调优人员便能够了解程序执行时期内核的种种细节,对性能症状作出更准确的诊断。
经常使用perf工具:
perf list:列出全部事件类型
perf top:类似top的全局监控工具
perf stat:对某个程序或者特定进程进行监控
perf record & perf report:先记录再通过具体报告的形式打印事件信息
- perf list
[root@localhost test]# perf list
List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] ref-cycles [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] minor-faults [Software event] major-faults [Software event] alignment-faults [Software event] emulation-faults [Software event] L1-dcache-loads [Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-stores [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-prefetches [Hardware cache event] L1-dcache-prefetch-misses [Hardware cache event] L1-icache-loads [Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-prefetches [Hardware cache event] L1-icache-prefetch-misses [Hardware cache event] LLC-loads [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-stores [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-prefetches [Hardware cache event] LLC-prefetch-misses [Hardware cache event] dTLB-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-stores [Hardware cache event] dTLB-store-misses [Hardware cache event] dTLB-prefetches [Hardware cache event] dTLB-prefetch-misses [Hardware cache event] iTLB-loads [Hardware cache event] iTLB-load-misses [Hardware cache event] branch-loads [Hardware cache event] branch-load-misses [Hardware cache event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] sunrpc:rpc_call_status [Tracepoint event] sunrpc:rpc_bind_status [Tracepoint event] sunrpc:rpc_connect_status [Tracepoint event] sunrpc:rpc_task_begin [Tracepoint event] sunrpc:rpc_task_run_action [Tracepoint event] sunrpc:rpc_task_complete [Tracepoint event] sunrpc:rpc_task_sleep [Tracepoint event] sunrpc:rpc_task_wakeup [Tracepoint event] ext4:ext4_free_inode [Tracepoint event] ext4:ext4_request_inode [Tracepoint event] ext4:ext4_allocate_inode [Tracepoint event] ext4:ext4_write_begin [Tracepoint event] ext4:ext4_da_write_begin [Tracepoint event] 省略............
当中通过perf list命令执行在不同的系统会列出不同的结果,在 2.6.35 版本号的内核中,该列表已经相当的长,但不管有多少,我们能够将它们划分为三类:
Hardware Event 是由 PMU 硬件产生的事件。比方 cache 命中,当您须要了解程序对硬件特性的使用情况时,便须要对这些事件进行採样;Software Event 是内核软件产生的事件。比方进程切换,tick 数等 ;
Tracepoint event 是内核中的静态 tracepoint 所触发的事件,这些 tracepoint 用来推断程序执行期间内核的行为细节。比方 slab 分配器的分配次数等。
在操作系统执行过程中。关于系统调用的调度优先级别,从高到低是这种:
硬中断->软中断->实时进程->内核进程->用户进程
Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request(硬中断)
Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.(软中断)
Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.(实时进程,会抢占其它非实时进程时间片)
Kernel (System) Processes – All kernel processing is handled at this level of priority. (内核进程)
User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.(用户进程)
- perf top
Samples: 475 of event 'cycles', Event count (approx.): 165419249 10.30% [kernel] [k] kallsyms_expand_symbol 7.75% perf [.] 0x000000000005f190 5.68% libc-2.12.so [.] memcpy 5.45% [kernel] [k] format_decode 5.45% libc-2.12.so [.] __strcmp_sse42 5.45% perf [.] symbols__insert 5.24% [kernel] [k] memcpy 4.86% [kernel] [k] vsnprintf 4.62% [kernel] [k] string 4.45% [kernel] [k] number 3.35% libc-2.12.so [.] __strstr_sse42 3.15% perf [.] hex2u64 2.72% perf [.] rb_next 2.10% [kernel] [k] pointer 1.89% [kernel] [k] security_real_capable_noaudit 1.89% [kernel] [k] strnlen 1.88% perf [.] rb_insert_color 1.68% libc-2.12.so [.] _int_malloc 1.68% libc-2.12.so [.] _IO_getdelim 1.28% [kernel] [k] update_iter 1.05% [kernel] [k] s_show 1.05% libc-2.12.so [.] memchr 1.05% [kernel] [k] get_task_cred 0.88% [kernel] [k] seq_read 0.85% [kernel] [k] clear_page_c 0.84% perf [.] symbol__new 0.63% libc-2.12.so [.] __libc_calloc 0.63% [kernel] [k] copy_user_generic_string 0.63% [kernel] [k] seq_vprintf 0.63% perf [.] kallsyms__parse 0.63% libc-2.12.so [.] _IO_feof 0.63% [kernel] [k] strcmp 0.63% [kernel] [k] page_fault 0.63% perf [.] dso__load_sym 0.42% perf [.] strstr@plt 0.42% libc-2.12.so [.] __strchr_sse42 0.42% [kernel] [k] seq_printf 0.42% libc-2.12.so [.] __memset_sse2 0.42% libelf-0.152.so [.] gelf_getsym 0.25% [kernel] [k] __mutex_init 0.21% [kernel] [k] _spin_lock 0.21% [kernel] [k] s_next 0.21% [kernel] [k] fsnotify 0.21% [kernel] [k] sys_read 0.21% [kernel] [k] __link_path_walk 0.21% [kernel] [k] intel_pmu_disable_all 0.21% [kernel] [k] mem_cgroup_charge_common 0.21% [kernel] [k] update_shares 0.21% [kernel] [k] native_read_tsc 0.21% [kernel] [k] apic_timer_interrupt 0.21% [kernel] [k] auditsys 0.21% [kernel] [k] do_softirq 0.21% [kernel] [k] update_process_times 0.21% [kernel] [k] native_sched_clock 0.21% [kernel] [k] mutex_lock 0.21% [kernel] [k] module_get_kallsymperf top 类似top的作用。能够总体查看系统的所有事件而且按数量排序。
- perf stat
perf stat能够直接接上命令或者通过pid形式对进程进行性能统计,完毕后会打印出这个过程的性能数据统计
[root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000 记录了1000000+0 的读入 记录了1000000+0 的写出 512000000字节(512 MB)已复制,0.790059 秒,648 MB/秒 Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000': 791.176286 task-clock # 1.000 CPUs utilized 1 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 247 page-faults # 0.312 K/sec 1,248,519,452 cycles # 1.578 GHz [83.29%] 366,166,452 stalled-cycles-frontend # 29.33% frontend cycles idle [83.33%] 155,120,002 stalled-cycles-backend # 12.42% backend cycles idle [66.62%] 1,947,919,494 instructions # 1.56 insns per cycle # 0.19 stalled cycles per insn [83.31%] 355,465,524 branches # 449.287 M/sec [83.41%] 2,021,648 branch-misses # 0.57% of all branches [83.42%] 0.791116595 seconds time elapsed
通过pid指定某进程进行性能监測,按ctrl-C结束,然后计算这段时间内的性能统计
[root@localhost test]# perf stat -p 18669 ^C Performance counter stats for process id '18669': 1.520699 task-clock # 0.001 CPUs utilized 56 context-switches # 0.037 M/sec 0 cpu-migrations # 0.000 K/sec 0 page-faults # 0.000 K/sec 2,178,120 cycles # 1.432 GHz [63.18%] 1,410,393 stalled-cycles-frontend # 64.75% frontend cycles idle [90.94%] 942,665 stalled-cycles-backend # 43.28% backend cycles idle 1,067,824 instructions # 0.49 insns per cycle # 1.32 stalled cycles per insn 193,104 branches # 126.984 M/sec 14,544 branch-misses # 7.53% of all branches [61.93%] 2.061889979 seconds time elapsed
[root@localhost test]# perf stat usage: perf stat [<options>] [<command>] -e, --event <event> event selector. use 'perf list' to list available events --filter <filter> event filter -i, --no-inherit child tasks do not inherit counters -p, --pid <pid> stat events on existing process id -t, --tid <tid> stat events on existing thread id -a, --all-cpus system-wide collection from all CPUs -g, --group put the counters into a counter group -c, --scale scale/normalize counters -v, --verbose be more verbose (show counter open errors, etc) -r, --repeat <n> repeat command and print average + stddev (max: 100) -n, --null null run - dont start any counters -d, --detailed detailed run - start a lot of events -S, --sync call sync() before starting a run -B, --big-num print large numbers with thousands' separators -C, --cpu <cpu> list of cpus to monitor in system-wide -A, --no-aggr disable CPU count aggregation -x, --field-separator <separator> print counts with custom separator -G, --cgroup <name> monitor event in cgroup name only -o, --output <file> output file name --append append to the output file --log-fd <n> log output to fd, instead of stderr --pre <command> command to run prior to the measured command --post <command> command to run after to the measured command
perf stat 还有其它实用的选项如 -e 是匹配指定事件
perf stat -e cycles,instructions,cache-misses [...]也能够指定监控的级别
perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000详细含义是:
Modifiers | Description | Example |
---|---|---|
u | monitor at priv level 3, 2, 1 (user) | event:u |
k | monitor at priv level 0 (kernel) | event:k |
h | monitor hypervisor events on a virtualization environment | event:h |
H | monitor host machine on a virtualization environment | event:H |
G | monitor guest machine on a virtualization environment | event:G |
操作系统资源一般分为四大部分:CPU、内存、IO、网络
easy出现瓶颈的通常是CPU和IO。所以我这里写了两个小程序分别模拟这两种情况,然后通过使用perf一系列的工具找到症结所在。
程序1是计算素数,专门用来耗系统CPU的
#include <stdio.h> int IsPrime(int num) { int i=2; for(;i<=num/2;i++) if(0==num%i) return 0; return 1; } void main() { int num; for(num=2;num<=10000000;num++) IsPrime(num); }
程序2是开n个线程。每隔1ms写1K的数据到文件,读1k的数据。用来耗系统IO
#include <unistd.h> #include <stdio.h> #include <string> #include <sys/types.h> #include <sys/wait.h> #include <sys/stat.h> #include <fcntl.h> #include <stdlib.h> #include <vector> using namespace std; bool gquit = false; void* threadfun(void* param) { string filename = *(string*)param; unlink(filename.c_str()); int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666); if (fd < 0) return 0; string a(1024*16*1024, 'a'); bool flag = true; while (!gquit) { if (flag) { write(fd, a.c_str(), a.length()); } else { read(fd, &a[0], a.length()); lseek(fd, 0, SEEK_SET); } flag = !flag; usleep(1); } close(fd); return 0; } int main(int argc, char* argv[]) { if (argc != 2) { printf("need thread num "); return 0; } int threadnum = atoi(argv[1]); vector<string> v(threadnum, "./testio.filex_"); vector<pthread_t> vpid(threadnum, 0); for(int i = 0 ; i < threadnum ; ++ i) { pthread_attr_t attr; pthread_attr_init(&attr); static char buf[32]; snprintf(buf, sizeof(buf), "%d", i); v[i] += buf; pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]); } printf("press q to quit "); while(getchar() != 'q'); gquit = true; for(int i = 0 ; i < threadnum ; ++ i) { void* ret = NULL; pthread_join(vpid[i], &ret); unlink(v[i].c_str()); } return 0; }
我先执行程序1后。通过perf top进行全局监控
Samples: 1K of event 'cycles', Event count (approx.): 690036067 74.14% cpu [.] IsPrime 2.45% [kernel] [k] kallsyms_expand_symbol 2.21% perf [.] 0x000000000005f225 1.65% perf [.] symbols__insert 1.50% perf [.] hex2u64 1.41% libc-2.12.so [.] __strcmp_sse42 1.26% [kernel] [k] string 1.25% [kernel] [k] format_decode 1.10% [kernel] [k] vsnprintf 0.95% [kernel] [k] strnlen 0.85% [kernel] [k] memcpy 0.80% libc-2.12.so [.] memcpy 0.70% [kernel] [k] number 0.60% perf [.] rb_next 0.60% libc-2.12.so [.] _int_malloc 0.55% libc-2.12.so [.] __strstr_sse42 0.55% [kernel] [k] pointer 0.55% perf [.] rb_insert_color
能够看出排名第一是cpu这个进程的IsPrime函数占用了最多的CPU性能
然后通过perf stat 指定pid进行监控
[root@localhost test]# perf stat -p 10704 ^C Performance counter stats for process id '10704': 4067.488066 task-clock # 1.001 CPUs utilized [100.00%] 5 context-switches # 0.001 K/sec [100.00%] 0 cpu-migrations # 0.000 K/sec [100.00%] 0 page-faults # 0.000 K/sec 10,743,902,971 cycles # 2.641 GHz [83.32%] 5,863,649,174 stalled-cycles-frontend # 54.58% frontend cycles idle [83.32%] 1,347,218,352 stalled-cycles-backend # 12.54% backend cycles idle [66.69%] 14,625,550,503 instructions # 1.36 insns per cycle # 0.40 stalled cycles per insn [83.34%] 1,950,981,334 branches # 479.653 M/sec [83.35%] 46,273 branch-misses # 0.00% of all branches [83.32%] 4.064800952 seconds time elapsed
能够看到各种系统事件的统计信息
然后使用perf record和report进行更加详细的分析
[root@localhost test]# perf record -g -- ./cpu ^C[ perf record: Woken up 4 times to write data ] [ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ] ./cpu: Interrupt [root@localhost test]# perf report Samples: 11K of event 'cycles', Event count (approx.):6815943846 + 99.70% cpu cpu [.] IsPrime + 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt + 0.03% cpu cpu [.] main + 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe + 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore + 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context + 0.01% cpu [kernel.kallsyms] [k] rb_insert_color + 0.01% cpu [kernel.kallsyms] [k] raise_softirq + 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end + 0.01% cpu [kernel.kallsyms] [k] scheduler_tick + 0.01% cpu [kernel.kallsyms] [k] do_timer + 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave + 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt + 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer + 0.01% cpu [kernel.kallsyms] [k] _spin_lock + 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer + 0.01% cpu [kernel.kallsyms] [k] idle_cpu非常清晰能够看到各种函数的调用情况,因为这个程序是个非常easy的代码。所以easy看出问题所在,可是假设是大项目的某个功能出现故障,通过perf工具就能够非常方便地定位问题,后面再看看一个占用IO的样例。
相同也是先启动程序2,先通过perf top、perf stat -p pid这两个命令先定位是哪个进程导致性能问题
Samples: 394K of event 'cycles', Event count (approx.): 188517759438 75.39% [kernel] [k] _spin_lock 11.39% [kernel] [k] copy_user_generic_string 1.89% [jbd2] [k] jbd2_journal_stop 1.66% [kernel] [k] find_get_page 1.51% [jbd2] [k] start_this_handle 0.95% [ext4] [k] ext4_da_write_end 0.76% [kernel] [k] iov_iter_fault_in_readable 0.74% [ext4] [k] ext4_journal_start_sb 0.56% [jbd2] [k] __jbd2_log_space_left 0.44% [ext4] [k] __ext4_journal_stop 0.44% [kernel] [k] __block_prepare_write 0.41% [kernel] [k] __wake_up_bit 0.35% [kernel] [k] _cond_resched 0.32% [kernel] [k] kmem_cache_free
能够看到非常多 spin lock的事件和各种文件操作事件
直接通过perf record和report进行分析,得出结果
[root@localhost test]# perf record -g -- ./io 10 press q to quit q [ perf record: Woken up 83 times to write data ] [ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ] [root@localhost test]# perf report Samples: 138K of event 'cycles', Event count (approx.): 64858537028 + 65.87% io [kernel.kallsyms] [k] _spin_lock + 14.19% io [kernel.kallsyms] [k] copy_user_generic_string + 2.74% io [jbd2] [k] jbd2_journal_stop + 2.46% io [kernel.kallsyms] [k] find_get_page + 2.11% io [jbd2] [k] start_this_handle + 1.06% io [kernel.kallsyms] [k] iov_iter_fault_in_readable + 1.05% io [ext4] [k] ext4_journal_start_sb + 0.85% io [ext4] [k] ext4_da_write_end + 0.73% io [jbd2] [k] __jbd2_log_space_left + 0.66% io [kernel.kallsyms] [k] generic_write_end + 0.61% io [ext4] [k] __ext4_journal_stop + 0.57% io [kernel.kallsyms] [k] __block_prepare_write + 0.54% io [kernel.kallsyms] [k] __wake_up_bit
好的,以上就是perf经常使用的工具,当然perf不止这几个工具。但这几个是最基础最经常使用的工具
通过查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 能够了解很多其它信息