a possible low-level optimization

zoukankan html css js c++ java

a possible low-level optimization
http://www.1point3acres.com/bbs/thread-212960-1-1.html

第二轮白人小哥，一开始问了一道至今不懂的问题，好像是给一个vector<uint8_t> nums, 然后又给一个256位的vector<int> counts，遍历nums，然后counts[nums]++，问如何进行优化，提示说要用到CPU cache之类的东西(完全不知道)。小白哥见我懵逼，后来又给了一道3sum，迅速做出。
uint8_t input[102400]; uint32_t count[256]; void count_it() { for (int i = 0; i < sizeof(input) / sizeof(input[0]); i++) { ++count[input[i]]; } }
how to optimize? possible points to consider:

a) target "count" array size is 4B*256=1KB, which can fit into L1 cache, so no need to worry about that;

b) input array access is sequential, which is actually cache friendly;

c) update to "count" could have false sharing, but given it's all in L1 cache, that's fine;

d) optimization 1: the loop could be unrolled to reduce loop check;

e) optimization 2: input array could be pre-fetched (i.e. insert PREFETCH instructions beforehand);
for (int i = 0; i < sizeof(input) / sizeof(input[0]);) { // typical cache size is 64 bytes __builtin_prefetch(&input[i+64], 0, 3); // prefetch for read, high locality for (int j = 0; j < 8; j++) { int k = i + j * 8; ++count[input[k]]; ++count[input[k+1]]; ++count[input[k+2]]; ++count[input[k+3]]; ++count[input[k+4]]; ++count[input[k+5]]; ++count[input[k+6]]; ++count[input[k+7]]; } i += 64; }
(see https://gcc.gnu.org/onlinedocs/gcc-5.4.0/gcc/Other-Builtins.html for __builtin_prefetch)

f) optimization 3: multi-threading, but need to use lock instruction when incrementing the count;

g) optimization 4: vector extension CPU instructions: "gather" instruction to load sparse locations (count[xxx]) to a zmmx register (512bit, 64byte i.e. 16 integers), then it can process 16 input uchar8_t in one go; then add a constant 512bit integer which adds 1 to each integer. corresponding "scatter" instruction will store back the updated count.

第二轮白人小哥，一开始问了一道至今不懂的问题，好像是给一个vector<uint8_t> nums, 然后又给一个256位的vector<int> counts，遍历nums，然后counts[nums]++，问如何进行优化，提示说要用到CPU cache之类的东西(完全不知道)。小白哥见我懵逼，后来又给了一道3sum，迅速做出。
查看全文

相关阅读:
ArcGIS 10与ArcEngine 10安装及破解
 SQL Server：触发器详解
 sql事务（Transaction）用法介绍及回滚实例
 Brief Tour of the Standard Library
Python Scopes and Namespaces
Saving structured data with json
Packages
“Compiled” Python files
Executing modules as scripts
More on Conditions

原文地址：https://www.cnblogs.com/qsort/p/6094767.html