概述
最近需要对一些很长的 msyql 字段做索引优化。讨论下来有几种解决方案带确定,其中一个就是对现有字符做 hash,然后对此hash和原始字符做联合索引。就此有了 hash 效率比较的需求,文中使用 php 对一段字符做 200 万次 hash,并输出程序执行时间。
系统信息
- Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
- Memery 12GB, Swap 12GB
- HDD 500GB
- Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
- PHP 7.0.33-0+deb9u1 (cli) (built: Dec 7 2018 11:36:49) ( NTS )
执行速度排序
Array
(
[fnv132] => 0.26369595527649
[fnv1a32] => 0.2675929069519
[fnv164] => 0.27193093299866
[adler32] => 0.27417206764221
[fnv1a64] => 0.28172397613525
[joaat] => 0.29366397857666
[crc32b] => 0.34514021873474
[crc32] => 0.37110996246338
[md4] => 0.44389486312866
[md5] => 0.46207499504089
[tiger128,3] => 0.54009604454041
[tiger160,3] => 0.55391597747803
[tiger192,3] => 0.57025694847107
[sha1] => 0.57897710800171
[tiger128,4] => 0.61153793334961
[tiger160,4] => 0.62242317199707
[tiger192,4] => 0.6432900428772
[ripemd128] => 0.80352711677551
[ripemd256] => 0.84451103210449
[ripemd160] => 1.0310969352722
[sha224] => 1.0542829036713
[sha256] => 1.0582711696625
[ripemd320] => 1.0992820262909
[sha384] => 1.3508479595184
[sha512] => 1.396675825119
[haval128,3] => 1.4093809127808
[haval192,3] => 1.4192271232605
[haval160,3] => 1.4261200428009
[haval224,3] => 1.4328649044037
[haval256,3] => 1.443500995636
[haval128,4] => 1.7986199855804
[haval160,4] => 1.8255050182343
[haval192,4] => 1.8294408321381
[haval256,4] => 1.8410999774933
[haval224,4] => 1.841756105423
[haval128,5] => 2.1614220142365
[haval160,5] => 2.1736621856689
[haval192,5] => 2.1849989891052
[haval224,5] => 2.1921010017395
[haval256,5] => 2.1987628936768
[whirlpool] => 2.3075139522552
[gost] => 4.3380508422852
[gost-crypto] => 4.3576400279999
[snefru256] => 6.5909118652344
[snefru] => 6.6243891716003
[md2] => 15.983593940735
)
测试代码及结果
$arr_supperted_algos = hash_algos();
$arr_proc_time = array();
$time0 = microtime(true);
// get all supperted hash algos
if ($arr_supperted_algos !== null){
echo "------------------------- Support hash algos:-------------------------
";
print_r($arr_supperted_algos);
echo "========================= Task begin: =========================
";
foreach($arr_supperted_algos as $index=>$algos){
$str_tmp = uniqid(true) . (microtime());
$time_inner = microtime(true);
for ($i=0; $i<2000000; $i++){
hash($algos,$str_tmp);
}
$used_seconds = microtime(true) - $time_inner;
echo ">-- {$algos} processed in {$used_seconds} seconds.
";
$arr_proc_time[$algos] = $used_seconds;
}
echo "++++++++++++++++++++++ summary (sorted by action time): ++++++++++++++++++++++
";
// 按照数组的 "值" 升序排列,参见: https://secure.php.net/manual/zh/array.sorting.php
asort($arr_proc_time);
print_r($arr_proc_time);
}
$time1 = microtime(true) - $time0;
echo "Finish. Total time: {$time1} seconds.
";
结论
crc32 速度比 md5 快了不少,
在我的另一项测试中发现,40万字符 hash 测试中
- 第一次有 13 个重复项;
- 第二次有 53 个重复项;
- 第三次有 4 个重复项;
- 第四次有 12 个重复项;
- 第五次有 15 个重复项
在 4 千万条记录中。。。
抱歉,没测完,PHP 速度太慢了[捂脸],不过用 Java 测下来的结果如下:
- 186031 条重复 @168秒
- 185386 条重复 @110秒 (使用了HashSet预置容量400000000)
- 185514 条重复 @110秒
加上,CRC32 能输出 数字类型(大约是 12 位左右,记得 mysql 中用 bigint ),其重复率千分之四,在DB中效果应该不错,回头试试。