zoukankan      html  css  js  c++  java
  • #定位系统性能瓶颈# perf

    perf是一个基于Linux 2.6+的调优工具,在liunx性能測量抽象出一套适应于各种不同CPU硬件的通用測量方法,其数据来源于比較新的linux内核提供的 perf_event 接口

    系统事件:
    perf tool 支持一系列可计算的事件类型。该工具和底层内核接口能够监測来自不同来源的事件。
    比如,一些事件是来源于纯粹的内核计数器,这些event在这样的情况下被称为软件事件。样例包含:context-switch、minor-fault等。


    事件的还有一个来源是处理器本身及其性能监控装置(PMU)。

    它提供了一系列事件来监測 micro-architectural 事件,如cycle的数量、指令退出、L1缓存不命中等等。

    这些事件被称为PMU硬件事件或硬件事件。每种处理器型号都有各种不同的硬件事件类型。

    perf_events接口还提供了一组通用的硬件事件名称。


    在每一个处理器,这些事件映射到实际的CPU提供的实际存在的事件类型,假设不存在则不能使用。
    并且这些也被称为硬件事件(hardware events)和硬件缓存事件(hardware cache events)。

    最后,还有各种 tracepoint事件是由内核ftrace框架实现的。并且这些事件都是2.6.3x或更加新的内核才支持。

    Tracepoints:
    Tracepoint 是散落在内核源码中的一些 hook。一旦使能。它们便能够在特定的代码被执行到时被触发,这一特性能够被各种 trace/debug 工具所使用。Perf 就是该特性的用户之中的一个。


    假如您想知道在应用程序执行期间,内核内存管理模块的行为。便能够利用潜伏在 slab 分配器中的 tracepoint。当内核执行到这些 tracepoint 时。便会通知 perf。


    Perf 将 tracepoint 产生的事件记录下来。生成报告,通过分析这些报告,调优人员便能够了解程序执行时期内核的种种细节,对性能症状作出更准确的诊断。

    经常使用perf工具:
    perf list:列出全部事件类型
    perf top:类似top的全局监控工具
    perf stat:对某个程序或者特定进程进行监控

    perf record & perf report:先记录再通过具体报告的形式打印事件信息


    • perf list

    [root@localhost test]# perf list

    List of pre-defined events (to be used in -e):
      cpu-cycles OR cycles                               [Hardware event]
      instructions                                       [Hardware event]
      cache-references                                   [Hardware event]
      cache-misses                                       [Hardware event]
      branch-instructions OR branches                    [Hardware event]
      branch-misses                                      [Hardware event]
      bus-cycles                                         [Hardware event]
      stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
      stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
      ref-cycles                                         [Hardware event]
    
      cpu-clock                                          [Software event]
      task-clock                                         [Software event]
      page-faults OR faults                              [Software event]
      context-switches OR cs                             [Software event]
      cpu-migrations OR migrations                       [Software event]
      minor-faults                                       [Software event]
      major-faults                                       [Software event]
      alignment-faults                                   [Software event]
      emulation-faults                                   [Software event]
    
      L1-dcache-loads                                    [Hardware cache event]
      L1-dcache-load-misses                              [Hardware cache event]
      L1-dcache-stores                                   [Hardware cache event]
      L1-dcache-store-misses                             [Hardware cache event]
      L1-dcache-prefetches                               [Hardware cache event]
      L1-dcache-prefetch-misses                          [Hardware cache event]
      L1-icache-loads                                    [Hardware cache event]
      L1-icache-load-misses                              [Hardware cache event]
      L1-icache-prefetches                               [Hardware cache event]
      L1-icache-prefetch-misses                          [Hardware cache event]
      LLC-loads                                          [Hardware cache event]
      LLC-load-misses                                    [Hardware cache event]
      LLC-stores                                         [Hardware cache event]
      LLC-store-misses                                   [Hardware cache event]
      LLC-prefetches                                     [Hardware cache event]
      LLC-prefetch-misses                                [Hardware cache event]
      dTLB-loads                                         [Hardware cache event]
      dTLB-load-misses                                   [Hardware cache event]
      dTLB-stores                                        [Hardware cache event]
      dTLB-store-misses                                  [Hardware cache event]
      dTLB-prefetches                                    [Hardware cache event]
      dTLB-prefetch-misses                               [Hardware cache event]
      iTLB-loads                                         [Hardware cache event]
      iTLB-load-misses                                   [Hardware cache event]
      branch-loads                                       [Hardware cache event]
      branch-load-misses                                 [Hardware cache event]
    
      rNNN                                               [Raw hardware event descriptor]
      cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
      
      sunrpc:rpc_call_status                             [Tracepoint event]
      sunrpc:rpc_bind_status                             [Tracepoint event]
      sunrpc:rpc_connect_status                          [Tracepoint event]
      sunrpc:rpc_task_begin                              [Tracepoint event]
      sunrpc:rpc_task_run_action                         [Tracepoint event]
      sunrpc:rpc_task_complete                           [Tracepoint event]
      sunrpc:rpc_task_sleep                              [Tracepoint event]
      sunrpc:rpc_task_wakeup                             [Tracepoint event]
      ext4:ext4_free_inode                               [Tracepoint event]
      ext4:ext4_request_inode                            [Tracepoint event]
      ext4:ext4_allocate_inode                           [Tracepoint event]
      ext4:ext4_write_begin                              [Tracepoint event]
      ext4:ext4_da_write_begin                           [Tracepoint event]
    	省略............

    当中通过perf list命令执行在不同的系统会列出不同的结果,在 2.6.35 版本号的内核中,该列表已经相当的长,但不管有多少,我们能够将它们划分为三类:

    Hardware Event 是由 PMU 硬件产生的事件。比方 cache 命中,当您须要了解程序对硬件特性的使用情况时,便须要对这些事件进行採样;
    Software Event 是内核软件产生的事件。比方进程切换,tick 数等 ;
    Tracepoint event 是内核中的静态 tracepoint 所触发的事件,这些 tracepoint 用来推断程序执行期间内核的行为细节。比方 slab 分配器的分配次数等。


    在操作系统执行过程中。关于系统调用的调度优先级别,从高到低是这种:
    硬中断->软中断->实时进程->内核进程->用户进程

    Hard Interrupts – Devices tell the kernel that they are done processing. For example, a NIC delivers a packet or a hard drive provides an IO request(硬中断)
    Soft Interrupts – These are kernel software interrupts that have to do with maintenance of the kernel. For example, the kernel clock tick thread is a soft interrupt. It checks to make sure a process has not passed its allotted time on a processor.(软中断)
    Real Time Processes – Real time processes have more priority than the kernel itself. A real time process may come on the CPU and preempt (or “kick off”) the kernel. The Linux 2.4 kernel is NOT a fully preemptable kernel, making it not ideal for real time application programming.(实时进程,会抢占其它非实时进程时间片)
    Kernel (System) Processes – All kernel processing is handled at this level of priority. (内核进程)
    User Processes – This space is often referred to as “userland”. All software applications run in the user space. This space has the lowest priority in the kernel scheduling mechanism.(用户进程)


    • perf top

    Samples: 475  of event 'cycles', Event count (approx.): 165419249
     10.30%  [kernel]          [k] kallsyms_expand_symbol
      7.75%  perf              [.] 0x000000000005f190
      5.68%  libc-2.12.so      [.] memcpy
      5.45%  [kernel]          [k] format_decode
      5.45%  libc-2.12.so      [.] __strcmp_sse42
      5.45%  perf              [.] symbols__insert
      5.24%  [kernel]          [k] memcpy
      4.86%  [kernel]          [k] vsnprintf
      4.62%  [kernel]          [k] string
      4.45%  [kernel]          [k] number
      3.35%  libc-2.12.so      [.] __strstr_sse42
      3.15%  perf              [.] hex2u64
      2.72%  perf              [.] rb_next
      2.10%  [kernel]          [k] pointer
      1.89%  [kernel]          [k] security_real_capable_noaudit
      1.89%  [kernel]          [k] strnlen
      1.88%  perf              [.] rb_insert_color
      1.68%  libc-2.12.so      [.] _int_malloc
      1.68%  libc-2.12.so      [.] _IO_getdelim
      1.28%  [kernel]          [k] update_iter
      1.05%  [kernel]          [k] s_show
      1.05%  libc-2.12.so      [.] memchr
      1.05%  [kernel]          [k] get_task_cred
      0.88%  [kernel]          [k] seq_read
      0.85%  [kernel]          [k] clear_page_c
      0.84%  perf              [.] symbol__new
      0.63%  libc-2.12.so      [.] __libc_calloc
      0.63%  [kernel]          [k] copy_user_generic_string
      0.63%  [kernel]          [k] seq_vprintf
      0.63%  perf              [.] kallsyms__parse
      0.63%  libc-2.12.so      [.] _IO_feof
      0.63%  [kernel]          [k] strcmp
      0.63%  [kernel]          [k] page_fault
      0.63%  perf              [.] dso__load_sym
      0.42%  perf              [.] strstr@plt
      0.42%  libc-2.12.so      [.] __strchr_sse42
      0.42%  [kernel]          [k] seq_printf
      0.42%  libc-2.12.so      [.] __memset_sse2
      0.42%  libelf-0.152.so   [.] gelf_getsym
      0.25%  [kernel]          [k] __mutex_init
      0.21%  [kernel]          [k] _spin_lock
      0.21%  [kernel]          [k] s_next
      0.21%  [kernel]          [k] fsnotify
      0.21%  [kernel]          [k] sys_read
      0.21%  [kernel]          [k] __link_path_walk
      0.21%  [kernel]          [k] intel_pmu_disable_all
      0.21%  [kernel]          [k] mem_cgroup_charge_common
      0.21%  [kernel]          [k] update_shares
      0.21%  [kernel]          [k] native_read_tsc
      0.21%  [kernel]          [k] apic_timer_interrupt
      0.21%  [kernel]          [k] auditsys
      0.21%  [kernel]          [k] do_softirq
      0.21%  [kernel]          [k] update_process_times
      0.21%  [kernel]          [k] native_sched_clock
      0.21%  [kernel]          [k] mutex_lock
      0.21%  [kernel]          [k] module_get_kallsym
    perf top 类似top的作用。能够总体查看系统的所有事件而且按数量排序。

    • perf stat

    perf stat能够直接接上命令或者通过pid形式对进程进行性能统计,完毕后会打印出这个过程的性能数据统计

    [root@localhost test]# perf stat -B dd if=/dev/zero of=/dev/null count=1000000
    记录了1000000+0 的读入
    记录了1000000+0 的写出
    512000000字节(512 MB)已复制,0.790059 秒,648 MB/秒
    
     Performance counter stats for 'dd if=/dev/zero of=/dev/null count=1000000':
    
            791.176286 task-clock                #    1.000 CPUs utilized          
                     1 context-switches          #    0.001 K/sec                  
                     0 cpu-migrations            #    0.000 K/sec                  
                   247 page-faults               #    0.312 K/sec                  
         1,248,519,452 cycles                    #    1.578 GHz                     [83.29%]
           366,166,452 stalled-cycles-frontend   #   29.33% frontend cycles idle    [83.33%]
           155,120,002 stalled-cycles-backend    #   12.42% backend  cycles idle    [66.62%]
         1,947,919,494 instructions              #    1.56  insns per cycle        
                                                 #    0.19  stalled cycles per insn [83.31%]
           355,465,524 branches                  #  449.287 M/sec                   [83.41%]
             2,021,648 branch-misses             #    0.57% of all branches         [83.42%]
    
           0.791116595 seconds time elapsed

    通过pid指定某进程进行性能监測,按ctrl-C结束,然后计算这段时间内的性能统计

    [root@localhost test]# perf stat -p 18669
    ^C
     Performance counter stats for process id '18669':
    
              1.520699 task-clock                #    0.001 CPUs utilized          
                    56 context-switches          #    0.037 M/sec                  
                     0 cpu-migrations            #    0.000 K/sec                  
                     0 page-faults               #    0.000 K/sec                  
             2,178,120 cycles                    #    1.432 GHz                     [63.18%]
             1,410,393 stalled-cycles-frontend   #   64.75% frontend cycles idle    [90.94%]
               942,665 stalled-cycles-backend    #   43.28% backend  cycles idle   
             1,067,824 instructions              #    0.49  insns per cycle        
                                                 #    1.32  stalled cycles per insn
               193,104 branches                  #  126.984 M/sec                  
                14,544 branch-misses             #    7.53% of all branches         [61.93%]
    
           2.061889979 seconds time elapsed

    [root@localhost test]# perf stat
    
     usage: perf stat [<options>] [<command>]
    
        -e, --event <event>   event selector. use 'perf list' to list available events
            --filter <filter>
                              event filter
        -i, --no-inherit      child tasks do not inherit counters
        -p, --pid <pid>       stat events on existing process id
        -t, --tid <tid>       stat events on existing thread id
        -a, --all-cpus        system-wide collection from all CPUs
        -g, --group           put the counters into a counter group
        -c, --scale           scale/normalize counters
        -v, --verbose         be more verbose (show counter open errors, etc)
        -r, --repeat <n>      repeat command and print average + stddev (max: 100)
        -n, --null            null run - dont start any counters
        -d, --detailed        detailed run - start a lot of events
        -S, --sync            call sync() before starting a run
        -B, --big-num         print large numbers with thousands' separators
        -C, --cpu <cpu>       list of cpus to monitor in system-wide
        -A, --no-aggr         disable CPU count aggregation
        -x, --field-separator <separator>
                              print counts with custom separator
        -G, --cgroup <name>   monitor event in cgroup name only
        -o, --output <file>   output file name
            --append          append to the output file
            --log-fd <n>      log output to fd, instead of stderr
            --pre <command>   command to run prior to the measured command
            --post <command>  command to run after to the measured command

    perf stat 还有其它实用的选项如 -e 是匹配指定事件

    perf stat -e cycles,instructions,cache-misses [...]
    
    也能够指定监控的级别

    perf stat -e cycles:u dd if=/dev/zero of=/dev/null count=100000
    详细含义是:

    Modifiers Description Example
    u monitor at priv level 3, 2, 1 (user) event:u
    k monitor at priv level 0 (kernel) event:k
    h monitor hypervisor events on a virtualization environment event:h
    H monitor host machine on a virtualization environment event:H
    G monitor guest machine on a virtualization environment event:G

    这些修饰符事实上相当于flag标记的布尔类型,u是用户态,k是内核态。后面三个是和虚拟化相关的。



    操作系统资源一般分为四大部分:CPU、内存、IO、网络

    easy出现瓶颈的通常是CPU和IO。所以我这里写了两个小程序分别模拟这两种情况,然后通过使用perf一系列的工具找到症结所在。

    程序1是计算素数,专门用来耗系统CPU的

    #include <stdio.h>
    int IsPrime(int num)
    {
     int i=2;
     for(;i<=num/2;i++)
      if(0==num%i)
       return 0;
     return 1;
    }
    void main()
    {
     int num;
     for(num=2;num<=10000000;num++)
        IsPrime(num);
        
    }

    程序2是开n个线程。每隔1ms写1K的数据到文件,读1k的数据。用来耗系统IO

    #include <unistd.h>
    #include <stdio.h>
    #include <string>
    #include <sys/types.h>
    #include <sys/wait.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <stdlib.h>
    #include <vector>
    using namespace std;
    
    bool gquit = false;
    
    void* threadfun(void* param)
    {
        string filename = *(string*)param;
    
        unlink(filename.c_str());
        
        int fd = open(filename.c_str(), O_CREAT | O_RDWR, 666);
        if (fd < 0) return 0;
    
        string a(1024*16*1024, 'a');
        bool flag = true;
        while (!gquit)
        {
    
            if (flag)
            {
                write(fd, a.c_str(), a.length());
            }
            else
            {
                read(fd, &a[0], a.length());    
                lseek(fd, 0, SEEK_SET);
            }
            flag = !flag;
            usleep(1);
        }
        close(fd);
        return 0;
    }
    
    
    int main(int argc, char* argv[])
    {
        if (argc != 2)
        {
            printf("need thread num
    ");
            return 0;
        }
            
        int threadnum = atoi(argv[1]);
    
        vector<string> v(threadnum, "./testio.filex_");
        vector<pthread_t> vpid(threadnum, 0);
        for(int i = 0 ; i < threadnum ; ++ i)
        {
            pthread_attr_t attr;
            pthread_attr_init(&attr);
    
            static char buf[32];
            snprintf(buf, sizeof(buf), "%d", i);
            v[i] += buf;
            pthread_create(&vpid[i], & attr, threadfun, (void*)&v[i]); 
            
        }
        printf("press q to quit
    ");    
        
        while(getchar() != 'q');
    
        gquit = true;
        for(int i = 0 ; i < threadnum ; ++ i)
        {
            void* ret = NULL;
            pthread_join(vpid[i], &ret);    
            unlink(v[i].c_str());
        }   
        return 0;
    }
    

    我先执行程序1后。通过perf top进行全局监控

    Samples: 1K of event 'cycles', Event count (approx.): 690036067
     74.14%  cpu               [.] IsPrime
      2.45%  [kernel]          [k] kallsyms_expand_symbol
      2.21%  perf              [.] 0x000000000005f225
      1.65%  perf              [.] symbols__insert
      1.50%  perf              [.] hex2u64
      1.41%  libc-2.12.so      [.] __strcmp_sse42
      1.26%  [kernel]          [k] string
      1.25%  [kernel]          [k] format_decode
      1.10%  [kernel]          [k] vsnprintf
      0.95%  [kernel]          [k] strnlen
      0.85%  [kernel]          [k] memcpy
      0.80%  libc-2.12.so      [.] memcpy
      0.70%  [kernel]          [k] number
      0.60%  perf              [.] rb_next
      0.60%  libc-2.12.so      [.] _int_malloc
      0.55%  libc-2.12.so      [.] __strstr_sse42
      0.55%  [kernel]          [k] pointer
      0.55%  perf              [.] rb_insert_color

    能够看出排名第一是cpu这个进程的IsPrime函数占用了最多的CPU性能

    然后通过perf stat 指定pid进行监控

    [root@localhost test]# perf stat -p 10704
    ^C
     Performance counter stats for process id '10704':
    
           4067.488066 task-clock                #    1.001 CPUs utilized           [100.00%]
                     5 context-switches          #    0.001 K/sec                   [100.00%]
                     0 cpu-migrations            #    0.000 K/sec                   [100.00%]
                     0 page-faults               #    0.000 K/sec                  
        10,743,902,971 cycles                    #    2.641 GHz                     [83.32%]
         5,863,649,174 stalled-cycles-frontend   #   54.58% frontend cycles idle    [83.32%]
         1,347,218,352 stalled-cycles-backend    #   12.54% backend  cycles idle    [66.69%]
        14,625,550,503 instructions              #    1.36  insns per cycle        
                                                 #    0.40  stalled cycles per insn [83.34%]
         1,950,981,334 branches                  #  479.653 M/sec                   [83.35%]
                46,273 branch-misses             #    0.00% of all branches         [83.32%]
    
           4.064800952 seconds time elapsed
    

    能够看到各种系统事件的统计信息

    然后使用perf record和report进行更加详细的分析

    [root@localhost test]# perf record -g -- ./cpu    
    ^C[ perf record: Woken up 4 times to write data ]
    [ perf record: Captured and wrote 0.875 MB perf.data (~38244 samples) ]
    ./cpu: Interrupt
    
    
    [root@localhost test]# perf report
    Samples: 11K of event 'cycles', Event count (approx.):6815943846
    + 99.70% cpu cpu [.] IsPrime
    + 0.05% cpu [kernel.kallsyms] [k] hrtimer_interrupt
    + 0.03% cpu cpu [.] main
    + 0.02% cpu [kernel.kallsyms] [k] native_write_msr_safe
    + 0.02% cpu [kernel.kallsyms] [k] _spin_unlock_irqrestore
    + 0.02% cpu [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
    + 0.01% cpu [kernel.kallsyms] [k] rb_insert_color
    + 0.01% cpu [kernel.kallsyms] [k] raise_softirq
    + 0.01% cpu [kernel.kallsyms] [k] rcu_process_gp_end
    + 0.01% cpu [kernel.kallsyms] [k] scheduler_tick
    + 0.01% cpu [kernel.kallsyms] [k] do_timer
    + 0.01% cpu [kernel.kallsyms] [k] _spin_lock_irqsave
    + 0.01% cpu [kernel.kallsyms] [k] apic_timer_interrupt
    + 0.01% cpu [kernel.kallsyms] [k] tick_sched_timer
    + 0.01% cpu [kernel.kallsyms] [k] _spin_lock
    + 0.01% cpu [kernel.kallsyms] [k] __remove_hrtimer
    + 0.01% cpu [kernel.kallsyms] [k] idle_cpu
    

    
    非常清晰能够看到各种函数的调用情况,因为这个程序是个非常easy的代码。所以easy看出问题所在,可是假设是大项目的某个功能出现故障,通过perf工具就能够非常方便地定位问题,后面再看看一个占用IO的样例。
    

    相同也是先启动程序2,先通过perf top、perf stat -p pid这两个命令先定位是哪个进程导致性能问题

    Samples: 394K of event 'cycles', Event count (approx.): 188517759438
     75.39%  [kernel]             [k] _spin_lock
     11.39%  [kernel]             [k] copy_user_generic_string
      1.89%  [jbd2]               [k] jbd2_journal_stop
      1.66%  [kernel]             [k] find_get_page
      1.51%  [jbd2]               [k] start_this_handle
      0.95%  [ext4]               [k] ext4_da_write_end
      0.76%  [kernel]             [k] iov_iter_fault_in_readable
      0.74%  [ext4]               [k] ext4_journal_start_sb
      0.56%  [jbd2]               [k] __jbd2_log_space_left
      0.44%  [ext4]               [k] __ext4_journal_stop
      0.44%  [kernel]             [k] __block_prepare_write
      0.41%  [kernel]             [k] __wake_up_bit
      0.35%  [kernel]             [k] _cond_resched
      0.32%  [kernel]             [k] kmem_cache_free

    能够看到非常多 spin lock的事件和各种文件操作事件


    直接通过perf record和report进行分析,得出结果

    [root@localhost test]# perf record -g -- ./io 10
    press q to quit
    q
    [ perf record: Woken up 83 times to write data ]
    [ perf record: Captured and wrote 23.361 MB perf.data (~1020643 samples) ]
    
    [root@localhost test]# perf report
    Samples: 138K of event 'cycles', Event count (approx.): 64858537028
    +  65.87%  io  [kernel.kallsyms]    [k] _spin_lock
    +  14.19%  io  [kernel.kallsyms]    [k] copy_user_generic_string
    +   2.74%  io  [jbd2]               [k] jbd2_journal_stop
    +   2.46%  io  [kernel.kallsyms]    [k] find_get_page
    +   2.11%  io  [jbd2]               [k] start_this_handle
    +   1.06%  io  [kernel.kallsyms]    [k] iov_iter_fault_in_readable
    +   1.05%  io  [ext4]               [k] ext4_journal_start_sb
    +   0.85%  io  [ext4]               [k] ext4_da_write_end
    +   0.73%  io  [jbd2]               [k] __jbd2_log_space_left
    +   0.66%  io  [kernel.kallsyms]    [k] generic_write_end
    +   0.61%  io  [ext4]               [k] __ext4_journal_stop
    +   0.57%  io  [kernel.kallsyms]    [k] __block_prepare_write
    +   0.54%  io  [kernel.kallsyms]    [k] __wake_up_bit


    好的,以上就是perf经常使用的工具,当然perf不止这几个工具。但这几个是最基础最经常使用的工具

    通过查看官方wiki https://perf.wiki.kernel.org/index.php/Tutorial 能够了解很多其它信息





  • 相关阅读:
    怎样用HTML5 Canvas制作一个简单的游戏
    js面向对象
    javascript闭包
    javascript变量的作用域
    5-22
    5-23
    14-5-21 硬代码
    14-5-19 类和对象
    array
    生成干扰线
  • 原文地址:https://www.cnblogs.com/wzzkaifa/p/6993784.html
Copyright © 2011-2022 走看看