zoukankan      html  css  js  c++  java
  • NDK clang编译器的一个bug

    NDK clang编译器的一个bug

    问题代码

    float32_t Sum_float(float32_t *data, const int count)
    {
        float32x4_t res = vdupq_n_f32(0.0f);
        for(int i = 0; i < (count & (~15)); i += 16)
        {
            #if 01
            float32x4x4_t v0 = vld1q_f32_x4(data + i);
            float32x4_t v00 = v0.val[0];
            float32x4_t v01 = v0.val[1];
            float32x4_t v02 = v0.val[2];
            float32x4_t v03 = v0.val[3];
    
            #else
            float32x4_t v00 = vld1q_f32(data + i);
            float32x4_t v01 = vld1q_f32(data + i + 4);
            float32x4_t v02 = vld1q_f32(data + i + 8);
            float32x4_t v03 = vld1q_f32(data + i + 12);
    
            #endif
    
            v00 = vaddq_f32(v00, v02);
            v01 = vaddq_f32(v01, v03);
            res = vaddq_f32(res, vaddq_f32(v00, v01));        
    
        }
        float32x2_t res1 = vadd_f32(vget_low_f32(res), vget_high_f32(res));
    
        float32_t v0[2];
        vst1_f32(v0, res1);
        v0[0] += v0[1];
        for(int i = count & (~15); i < count; ++i){
            v0[0] += data[i];
        }
    
        return v0[0];
    }
    
    

    编译测试

    首先,查阅了https://static.docs.arm.com/ihi0073/c/IHI0073C_arm_neon_intrinsics_ref.pdf,对于vld1q_f32_x4这个指令,v7/A32/A64都是支持的。

    不同编译器版本结果:首先,对于所有的版本,如果使用#else块的代码,都是可以编译成功的,对于使用#if 01块的代码,结果如下:

    armeabi-v7a with o1 armeabi=v7a with o0 arm64-v8a
    r20c clang++: error: clang frontend command failed due to signal (use -v to see invocation) ok ok
    r19c ok ok ok
    r15c error: use of undeclared identifier 'vld1q_f32_x4' error: use of undeclared identifier 'vld1q_f32_x4' ok

    不仅仅vld1q_f32_x4,对于vld1_u8_x2;vst1q_f32_x4等类似指令都存在这样的问题。

    性能对比

    测试代码:

    int main()
    {
        const size_t len = 1024*1024 * 16;
        float32_t *data = new float32_t[len];
        for(size_t i = 0; i < len; ++i) {
            data[i] = std::rand() / 100.0;
        }
    
        clock_t t0 = std::clock();
        float32_t sum = Sum_float(data, len);
        printf("sum=%f , time cost=%f 
    ", sum, 1000.0 * (double)(std::clock() - t0) / CLOCKS_PER_SEC);
        return 0;
    }
    

    测试了使用三种NDK版本编译arm64-v8a测试,同时使用r19c编译了armeabi-v7a,分别使用#if#else分之,发现耗时都是在3.55ms左右,无明显差别。

    类似问题:https://github.com/mattgodbolt/compiler-explorer/issues/1906

    地址对齐

    虽然使用r19c的版本编译armeabi-v7a成功,或者使用不优化的r20c也一样,但是执行时发生了crash。原因是执行vldN(q)_type_xN指令时,地址不对齐导致的crash。

    而对于arm64-v8a版本,把所有传给vldN(q)_type_xN的地址打印出来,同样发现也有0x7350800001这样的地址,而且地址末位为0到E的都有,但是却没有报错。也即,对于该指令只有armeabi-v7a有地址对齐要求,而arm64-v8a却没有?

    同时,常规的vldN(q)_type指令则没有地址对齐的要求,所以最好不要使用vldN(q)_type_xN

    在代码中因为地址对齐而导致的crash日志:

    libc    : Fatal signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001 in tid 27659 (ClarityOpt), pid 27659 (ClarityOpt)
    crash_dump32: obtaining output fd from tombstoned, type: kDebuggerdTombstone
    crash_dump32: performing dump of process 27659 (target tid = 27659)
    DEBUG   : Process name is /data/local/tmp/ClarityOpt, not key_process
    DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
    DEBUG   : Build fingerprint: 'OPPO/PCCM00/OP4A7A:10/QKQ1.191222.002/1584699103:user/release-keys'
    DEBUG   : Revision: '0'
    DEBUG   : ABI: 'arm'
    DEBUG   : Timestamp: 2020-05-09 15:15:16+0800
    DEBUG   : pid: 27659, tid: 27659, name: ClarityOpt  >>> /data/local/tmp/ClarityOpt <<<
    DEBUG   : uid: 0
    crash_dump32: type=1400 audit(0.0:27044): avc: denied { read } for name="ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
    crash_dump32: type=1400 audit(0.0:27045): avc: denied { open } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
    crash_dump32: type=1400 audit(0.0:27046): avc: denied { getattr } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
    crash_dump32: type=1400 audit(0.0:27047): avc: denied { map } for path="/data/local/tmp/ClarityOpt" dev="sda11" ino=30524 scontext=u:r:crash_dump:s0 tcontext=u:object_r:shell_data_file:s0 tclass=file permissive=1
    DEBUG   : signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0xf0900001
    DEBUG   :     r0  00000043  r1  00000000  r2  a9a5ac6f  r3  00000003
    DEBUG   :     r4  f0900001  r5  ffcb0a00  r6  ffcb0a40  r7  ffcb0b60
    DEBUG   :     r8  f0900007  r9  00000001  r10 f0900000  r11 f0900000
    DEBUG   :     ip  ffcb0500  sp  ffcb09f0  lr  00000004  pc  021d265e
    DEBUG   : 
    DEBUG   : backtrace:
    DEBUG   :       #00 pc 0000365e  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
    DEBUG   :       #01 pc 00004271  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
    DEBUG   :       #02 pc 00004c9f  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
    DEBUG   :       #03 pc 00004dd3  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
    DEBUG   :       #04 pc 000513bb  /apex/com.android.runtime/lib/bionic/libc.so (__libc_init+66) (BuildId: 8e41d0dce7911ae25a51deb63aa9720c)
    DEBUG   :       #05 pc 00002a98  /data/local/tmp/ClarityOpt (BuildId: fb1d8b990741386becb60ff1c8b10583efb05f70)
    
  • 相关阅读:
    找到数组中消失的所有数字-算法刷题总结
    爬楼梯-算法练习笔记
    最长公共前缀-刷题总结
    每日温度-算法详细分析
    买卖股票的最佳时机-算法详细分析
    回文数-算法详细分析
    合并两个有序链表-算法详细法分析
    最短无序连续子数组 | 算法详细分析
    整数反转-算法详细分析
    python设计模式之责任链模式
  • 原文地址:https://www.cnblogs.com/willhua/p/12858725.html
Copyright © 2011-2022 走看看