zoukankan      html  css  js  c++  java
  • likely和unlikely是如何对代码的优化?

            
      在执行if判断时,可以使用GCC提供了__builtin_expect对代码进行优化,可以提高代码的运行速度,参考GCC手册的"3.10 Options That Control Optimization".
      原理是:CPU在执行指令时采用的是流水线的方式,一条指令的执行大致会经过"取码 --> 译码 -->执行",如果在执行时发现需要进行跳转的话,会flush流水线,然后从新的地址重新开始"取码 --> 译码 --> 执行",这个过程会降低代码的执行效率,所以尽量减少跳转的可能(也就是flush流水线的发生频率),就可以提高代码的执行效率 。
      下面用一个简单的程序为例分析一下。
     1 #include <stdio.h>
     2 
     3 #define likely(x)    __builtin_expect(!!(x), 1)
     4 #define unlikely(x)    __builtin_expect(!!(x), 0)
     5 
     6 void func1(int a)
     7 {
     8     int b;
     9 
    10     if (unlikely(a >= 0)) {
    11         b = a + 1;
    12         printf("b = %d
    ", b);
    13     } else {
    14         b = a + 2;
    15         printf("b = %d
    ", b);
    16     }
    17 }
    18 
    19 void func2(int a)
    20 {
    21     int b;
    22 
    23     if (likely(a >= 0)) {
    24         b = a + 1;
    25         printf("b = %d
    ", b);
    26     } else {
    27         b = a + 2;
    28         printf("b = %d
    ", b);
    29     }
    30 
    31 }
    32 
    33 int main(int argc, const char *argv[])
    34 {
    35     int a = 0;
    36 
    37     scanf("a = %d", &a);
    38 
    39     func1(a);
    40     func2(a);
    41 
    42     return 0;
    43 }

      likely(x)用于x为真的可能性更大的场景,unlikey(x)用于x为假的可能性更大的场景,这两个宏的最终目的就是尽量减少跳转,因为只要跳转,pipeline就会flush,就会降低效率。

    想让上面的优化生效的话,需要指定一定的优化等级,因为默认是-O0,没有任何优化。下面是-O0的反汇编:
    00000000004005bc <func1>:
      4005bc:    a9bd7bfd     stp    x29, x30, [sp, #-48]!
      4005c0:    910003fd     mov    x29, sp
      4005c4:    b9001fa0     str    w0, [x29, #28]
      4005c8:    b9401fa0     ldr    w0, [x29, #28]
      4005cc:    2a2003e0     mvn    w0, w0
      4005d0:    531f7c00     lsr    w0, w0, #31
      4005d4:    12001c00     and    w0, w0, #0xff
      4005d8:    92401c00     and    x0, x0, #0xff
      4005dc:    f100001f     cmp    x0, #0x0
      4005e0:    54000120     b.eq    400604 <func1+0x48>  // b.none
      4005e4:    b9401fa0     ldr    w0, [x29, #28]
      4005e8:    11000400     add    w0, w0, #0x1
      4005ec:    b9002fa0     str    w0, [x29, #44]
      4005f0:    90000000     adrp    x0, 400000 <_init-0x430>
      4005f4:    911e4000     add    x0, x0, #0x790
      4005f8:    b9402fa1     ldr    w1, [x29, #44]
      4005fc:    97ffffad     bl    4004b0 <printf@plt>
      400600:    14000008     b    400620 <func1+0x64>
      400604:    b9401fa0     ldr    w0, [x29, #28]
      400608:    11000800     add    w0, w0, #0x2
      40060c:    b9002fa0     str    w0, [x29, #44]
      400610:    90000000     adrp    x0, 400000 <_init-0x430>
      400614:    911e4000     add    x0, x0, #0x790
      400618:    b9402fa1     ldr    w1, [x29, #44]
      40061c:    97ffffa5     bl    4004b0 <printf@plt>
      400620:    d503201f     nop
      400624:    a8c37bfd     ldp    x29, x30, [sp], #48
      400628:    d65f03c0     ret
    
    000000000040062c <func2>:
      40062c:    a9bd7bfd     stp    x29, x30, [sp, #-48]!
      400630:    910003fd     mov    x29, sp
      400634:    b9001fa0     str    w0, [x29, #28]
      400638:    b9401fa0     ldr    w0, [x29, #28]
      40063c:    2a2003e0     mvn    w0, w0
      400640:    531f7c00     lsr    w0, w0, #31
      400644:    12001c00     and    w0, w0, #0xff
      400648:    92401c00     and    x0, x0, #0xff
      40064c:    f100001f     cmp    x0, #0x0
      400650:    54000120     b.eq    400674 <func2+0x48>  // b.none
      400654:    b9401fa0     ldr    w0, [x29, #28]
      400658:    11000400     add    w0, w0, #0x1
      40065c:    b9002fa0     str    w0, [x29, #44]
      400660:    90000000     adrp    x0, 400000 <_init-0x430>
      400664:    911e4000     add    x0, x0, #0x790
      400668:    b9402fa1     ldr    w1, [x29, #44]
      40066c:    97ffff91     bl    4004b0 <printf@plt>
      400670:    14000008     b    400690 <func2+0x64>
      400674:    b9401fa0     ldr    w0, [x29, #28]
      400678:    11000800     add    w0, w0, #0x2
      40067c:    b9002fa0     str    w0, [x29, #44]
      400680:    90000000     adrp    x0, 400000 <_init-0x430>
      400684:    911e4000     add    x0, x0, #0x790
      400688:    b9402fa1     ldr    w1, [x29, #44]
      40068c:    97ffff89     bl    4004b0 <printf@plt>
      400690:    d503201f     nop
      400694:    a8c37bfd     ldp    x29, x30, [sp], #48
      400698:    d65f03c0     ret

    可以看到,反汇编完全是按照C语言逻辑走的,一五一十,按部就班,上面的优化宏没有起到任何作用。

    下面先用-O1看看效果。GCC对-O和-O1的描述是:the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

    aarch64-linux-gnu-gcc predict.c -o predict -O1
    aarch64-linux-gnu-objdump -D predict > predict.S

    下面是func1的反汇编结果:

    00000000004005bc <func1>:
      4005bc:    a9bf7bfd     stp    x29, x30, [sp, #-16]!
      4005c0:    910003fd     mov    x29, sp
      4005c4:    36f800e0     tbz    w0, #31, 4005e0 <func1+0x24>
      4005c8:    11000801     add    w1, w0, #0x2
      4005cc:    90000000     adrp    x0, 400000 <_init-0x430>
      4005d0:    911c6000     add    x0, x0, #0x718
      4005d4:    97ffffb7     bl    4004b0 <printf@plt>
      4005d8:    a8c17bfd     ldp    x29, x30, [sp], #16
      4005dc:    d65f03c0     ret
      4005e0:    11000401     add    w1, w0, #0x1
      4005e4:    90000000     adrp    x0, 400000 <_init-0x430>
      4005e8:    911c6000     add    x0, x0, #0x718
      4005ec:    97ffffb1     bl    4004b0 <printf@plt>
      4005f0:    17fffffa     b    4005d8 <func1+0x1c>
      func1的代码里,unlikely(a >= 0)的可能性小,所以为了减少跳转,就应该将else分支里的代码往前放,这样指令就可以一条紧挨着一条执行,不用跳转,即PC每次加4,pipeline不用flush,提高了代码执行速度。与之相反的是func2中,likely(a >= 0)的可能性更大,为了减少分支跳转,所以需要将if分支对应的代码放在前面。下面是func2的反汇编:
    
    
    00000000004005f4 <func2>:
      4005f4:    a9bf7bfd     stp    x29, x30, [sp, #-16]!
      4005f8:    910003fd     mov    x29, sp
      4005fc:    37f800e0     tbnz    w0, #31, 400618 <func2+0x24>
      400600:    11000401     add    w1, w0, #0x1
      400604:    90000000     adrp    x0, 400000 <_init-0x430>
      400608:    911c6000     add    x0, x0, #0x718
      40060c:    97ffffa9     bl    4004b0 <printf@plt>
      400610:    a8c17bfd     ldp    x29, x30, [sp], #16
      400614:    d65f03c0     ret
      400618:    11000801     add    w1, w0, #0x2
      40061c:    90000000     adrp    x0, 400000 <_init-0x430>
      400620:    911c6000     add    x0, x0, #0x718
      400624:    97ffffa3     bl    4004b0 <printf@plt>
      400628:    17fffffa     b    400610 <func2+0x1c>
    当然,如果likely和unlikely用的不符合实际情况,代码的执行效率更恶化。
     
      下面我们在看看不同的优化等级下,对最终生成的机器码有什么影响:
      -O2:Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to ‘-O’, this option increases both compilation time and the performance of the generated code.
    00000000004005f8 <func1>:
      4005f8:    90000002     adrp    x2, 400000 <_init-0x430>
      4005fc:    36f80080     tbz    w0, #31, 40060c <func1+0x14>
      400600:    11000801     add    w1, w0, #0x2
      400604:    911ba040     add    x0, x2, #0x6e8
      400608:    17ffffaa     b    4004b0 <printf@plt>
      40060c:    11000401     add    w1, w0, #0x1
      400610:    911ba040     add    x0, x2, #0x6e8
      400614:    17ffffa7     b    4004b0 <printf@plt>
    
    0000000000400618 <func2>:
      400618:    90000002     adrp    x2, 400000 <_init-0x430>
      40061c:    37f80080     tbnz    w0, #31, 40062c <func2+0x14>
      400620:    11000401     add    w1, w0, #0x1
      400624:    911ba040     add    x0, x2, #0x6e8
      400628:    17ffffa2     b    4004b0 <printf@plt>
      40062c:    11000801     add    w1, w0, #0x2
      400630:    911ba040     add    x0, x2, #0x6e8
      400634:    17ffff9f     b    4004b0 <printf@plt>
      -O3:Optimize yet more. ‘-O3’ turns on all optimizations specified by ‘-O2’ and also turns on more optimization flags
    00000000004005f8 <func1>:
      4005f8:    90000002     adrp    x2, 400000 <_init-0x430>
      4005fc:    36f80080     tbz    w0, #31, 40060c <func1+0x14>
      400600:    11000801     add    w1, w0, #0x2
      400604:    911ba040     add    x0, x2, #0x6e8
      400608:    17ffffaa     b    4004b0 <printf@plt>
      40060c:    11000401     add    w1, w0, #0x1
      400610:    911ba040     add    x0, x2, #0x6e8
      400614:    17ffffa7     b    4004b0 <printf@plt>
    
    0000000000400618 <func2>:
      400618:    90000002     adrp    x2, 400000 <_init-0x430>
      40061c:    37f80080     tbnz    w0, #31, 40062c <func2+0x14>
      400620:    11000401     add    w1, w0, #0x1
      400624:    911ba040     add    x0, x2, #0x6e8
      400628:    17ffffa2     b    4004b0 <printf@plt>
      40062c:    11000801     add    w1, w0, #0x2
      400630:    911ba040     add    x0, x2, #0x6e8
      400634:    17ffff9f     b    4004b0 <printf@plt> 
     
      -Os:Optimize for size. ‘-Os’ enables all ‘-O2’ optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
    00000000004005f4 <func1>:
      4005f4:    90000002     adrp    x2, 400000 <_init-0x430>
      4005f8:    37f80080     tbnz    w0, #31, 400608 <func1+0x14>
      4005fc:    11000401     add    w1, w0, #0x1
      400600:    911b8040     add    x0, x2, #0x6e0
      400604:    17ffffab     b    4004b0 <printf@plt>
      400608:    11000801     add    w1, w0, #0x2
      40060c:    17fffffd     b    400600 <func1+0xc>
    
    0000000000400610 <func2>:
      400610:    90000002     adrp    x2, 400000 <_init-0x430>
      400614:    37f80080     tbnz    w0, #31, 400624 <func2+0x14>
      400618:    11000401     add    w1, w0, #0x1
      40061c:    911b8040     add    x0, x2, #0x6e0
      400620:    17ffffa4     b    4004b0 <printf@plt>
      400624:    11000801     add    w1, w0, #0x2
      400628:    17fffffd     b    40061c <func2+0xc>
    -Os主要是对代码尺寸的优化(可以看到,此时两个func反汇编出来的汇编指令是最少的),但是从执行效率看,就差点,likely和unlikey此时对代码没有起到任何优化效果。
     
    完。
  • 相关阅读:
    BUUCTF | SQL COURSE 1
    BUUCTF | 高明的黑客
    element el-upload自定义上传显示进度条,多文件上传进度
    100行代码实现vue表单校验功能(小白自编)
    element-ui中validateField怎么验证部分表单字段的正确与否
    react解析html的dangerouslySetInnerHTML
    【Hyper-V】与【VirtualBox】【VMware】冲突的解决方法
    迁移到webpack4:从webpack.optimize.CommonsChunkPlugin到config.optimization.splitChunk,以及有个搜出来的中文解决办法是错的
    React 如何解析从后台读取的内容是html格式代码(带样式)
    Vue的elementUI实现自定义主题
  • 原文地址:https://www.cnblogs.com/pengdonglin137/p/11021901.html
Copyright © 2011-2022 走看看