zoukankan html css js c++ java

Using SSE and AVX to accelerate the program

Today I compare the program speeding for three cases: normal code, with SSE and with AVX.

Normal code

// speed_test.c
const int NUM = 1000000;
static float X[NUM], Y[NUM], Z[NUM];
void dig_hole() {
    for(int j = 0; j < 1000; j++) {
        for(int i = 0; i < NUM; i++) {
            Z[i] = X[i] + Y[i];
        }
    }
}
int main() {
    for(int i = 0; i < NUM; i++) {
        X[i] = i * 1.2;
        Y[i] = i * 1.8;
    }
    dig_hole();
}

Compiling with g++ speed_test.c -o speed_test(g++ version 4.8.5) and run with Intel Xeon Processor E5-2620 v4:

time ./speed_test.c

SSE code

// speed_test_simd.c
#include <immintrin.h>
const int NUM = 1000000;
static float X[NUM], Y[NUM], Z[NUM];
void dig_hole() {
    __m128 a,b,c;
    for(int j = 0; j < 1000; j++) {
        for(int i = 0; i < NUM; i += 4) {
            a = _mm_load_ps(&X[i]);
            b = _mm_load_ps(&Y[i]);
            c = _mm_add_ps(a, b);
            _mm_store_ps(&Z[i], c);
        }
    }
}
int main() {
    for(int i = 0; i < NUM; i++) {
        X[i] = i * 1.2;
        Y[i] = i * 1.8;
    }
    dig_hole();
}

AVX code

// speed_test_avx.c
#include <immintrin.h>
const int NUM = 1000000;
static float X[NUM], Y[NUM], Z[NUM];
void dig_hole() {
    __m256 a,b,c;
    for(int j = 0; j < 1000; j++) {
        for(int i = 0; i < NUM; i += 8) {
            a = _mm256_load_ps(&X[i]);
            b = _mm256_load_ps(&Y[i]);
            c = _mm256_add_ps(a, b);
            _mm256_store_ps(&Z[i], c);
        }
    }
}
int main() {
    for(int i = 0; i < NUM; i++) {
        X[i] = i * 1.2;
        Y[i] = i * 1.8;
    }
    dig_hole();
}

Comping with g++ --mavx2 speed_test_avx.c -o speed_test_avx.

Result

Debug mode:

case	real	user	sys
normal	0m3.907s	0m3.890s	0m0.015s
sse	0m2.083s	0m2.072s	0m0.011s
avx	0m1.397s	0m1.384s	0m0.012s

Release mode (`-O2`):

case	real	user	sys
normal	0m0.887s	0m0.875s	0m0.011s
sse	0m0.590s	0m0.578s	0m0.012s
avx	0m0.594s	0m0.578s	0m0.016s

Analysis

Under Debug mode, codes are translated literally. SSE can speed up the program since it processes
4 float numbers at each inner loop. AVX can process 8 float numbers (32Bit * 8 = 256, 1 Float is 32Bit=4Byte long).

Under Release mode, the speed is not so much since the compiler has many good optimizations for normal operations as well. Especially, SSE and AVX are of almost the same performance.

查看全文

相关阅读:
ScrollVIEW 2000个ITEM不会卡
 嵌套ScrollView 左右滑动不影响上下滑动
 初学数据结构——栈和队列
 初学数据结构——单向循环链表和双向循环链表。
初学数据结构——单链表
 bootstrap模态框垂直居中
 Javascript经典实例
 Javascript经典实例
 读书笔记-前言
 web中的中文字体的英文名称

原文地址：https://www.cnblogs.com/zhaofeng-shu33/p/12673044.html