zoukankan      html  css  js  c++  java
  • elasticsearch min_hash 应用分析

    需求作相似文本查询

    爬虫作页面去重,会用到simhash,第一个想到的是用simhash算法

    但在现有数据集(elasticsearch集群)上用simhash,成本高,simhash值还好计算,不论是外部api还是实现一套es token filter都很容易实现.最大的难点在于查询,及相似度计算.需要根据simhash的距离,重写elasticsearch的评分逻辑.

    如果不考虑关键字权重的话,minhash和simhash的效果类似.

    目前新版的elasticsearch(5.5) 原生支持minhash,但官方的文档很粗略

    https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-minhash-tokenfilter.html

    遂参照minhash及es源码 调研测试了minhash 为图简单中文分词采用的ik

    四个参数

    SettingDescription

    hash_count

    The number of hashes to hash the token stream with. Defaults to 1.

    bucket_count

    The number of buckets to divide the minhashes into. Defaults to 512.

    hash_set_size

    The number of minhashes to keep per bucket. Defaults to 1.

    with_rotation

    Whether or not to fill empty buckets with the value of the first non-empty bucket to its circular right. Only takes effect if hash_set_size is equal to one. Defaults to trueif bucket_count is greater than one, else false.

    代码是是三层结构(懒得画图)

    [ hashcount_0:[
       bucket_count_0:[hash_set_size_0,hash_set_size_1...],
       bucket_count_1:[hash_set_size_0,hash_set_size_1...],
       ...
       bucket_count_511:[hash_set_size_0,hash_set_size_1...]
     ],
     hashcount_1:[
       bucket_count_0:[hash_set_size_0,hash_set_size_1...],
       bucket_count_1:[hash_set_size_0,hash_set_size_1...],
       ...
       bucket_count_511:[hash_set_size_0,hash_set_size_1...]
     ]
    ]

    出处 https://my.oschina.net/pathenon/blog/65210

           第一种:使用多个hash函数
    
            为了计算集合A、B具有最小哈希值的概率,我们可以选择一定数量的hash函数,比如K个。然后用这K个hash函数分别对集合A、B求哈希值,对
    每个集合都得到K个最小值。比如Min(A)k={a1,a2,...,ak},Min(B)k={b1,b2,...,bk}。
            那么,集合A、B的相似度为|Min(A)k ∩ Min(B)k| / |Min(A)k  ∪  Min(B)k|,及Min(A)k和Min(B)k中相同元素个数与总的元素个数的比例。
    
           第二种:使用单个hash函数
    
           第一种方法有一个很明显的缺陷,那就是计算复杂度高。使用单个hash函数是怎么解决这个问题的呢?请看:
           前面我们定义过 hmin(S)为集合S中具有最小哈希值的一个元素,那么我们也可以定义hmink(S)为集合S中具有最小哈希值的K个元素。这样一来,
    我们就只需要对每个集合求一次哈希,然后取最小的K个元素。计算两个集合A、B的相似度,就是集合A中最小的K个元素与集合B中最小的K个元素
    的交集个数与并集个数的比例。
    for (int i = 0; i < hashCount; i++) {
    byte[] bytes = current.getBytes("UTF-16LE");
    LongPair hash = new LongPair();
    murmurhash3_x64_128(bytes, 0, bytes.length, 0, hash);
    LongPair rehashed = combineOrdered(hash, getIntHash(i));
    minHashSets.get(i).get((int) ((rehashed.val2 >>> 32) / bucketSize)).add(rehashed);
    }

    es综合采用了这两种方法

    hashcount 是hash函数的个数 getIntHash根据i计算出不同偏移值,以不同的偏移值达到,按不同hash函数计算不同的hash值的效果,没有实际再计算多次 hash,减少计算量,这种优化选择javaer都熟悉,计算 hash 利用高位,jdk1.7之前ConcurrentMap hashmap,segments 两级 hash都类似(题外话,es文档集中hash,有官方的 MurmurHash 插件,性能比传统 hash 要好很多)

    bucket_count 是每个hash函数对集合所有元素计算出的bucket_count 便是hmink(S) 中的k值

    最终最小哈希值个数为

    hash_count* bucket_count

    该filter初始化时,构造三层结构

    while (input.incrementToken()) {
    found = true;
    String current = new String(termAttribute.buffer(), 0, termAttribute.length());

    for (int i = 0; i < hashCount; i++) {
    byte[] bytes = current.getBytes("UTF-16LE");
    LongPair hash = new LongPair();
    murmurhash3_x64_128(bytes, 0, bytes.length, 0, hash);
    LongPair rehashed = combineOrdered(hash, getIntHash(i));
    minHashSets.get(i).get((int) ((rehashed.val2 >>> 32) / bucketSize)).add(rehashed);
    }
    endOffset = offsetAttribute.endOffset();
    }
    exhausted = true;
    input.end();
    // We need the end state so an underlying shingle filter can have its state restored correctly.
    endState = captureState();
    if (!found) {
    return false;
    }

    遍历上一filter来的所有数据,作hash填充至三层结构

    while (hashPosition < hashCount) {
    if (hashPosition == -1) {
    hashPosition++;
    } else {
    while (bucketPosition < bucketCount) {
    if (bucketPosition == -1) {
    bucketPosition++;
    } else {
    LongPair hash = minHashSets.get(hashPosition).get(bucketPosition).pollFirst();
    if (hash != null) {
    termAttribute.setEmpty();
    if (hashCount > 1) {
    termAttribute.append(int0(hashPosition));
    termAttribute.append(int1(hashPosition));
    }
    long high = hash.val2;
    termAttribute.append(long0(high));
    termAttribute.append(long1(high));
    termAttribute.append(long2(high));
    termAttribute.append(long3(high));
    long low = hash.val1;
    termAttribute.append(long0(low));
    termAttribute.append(long1(low));
    if (hashCount == 1) {
    termAttribute.append(long2(low));
    termAttribute.append(long3(low));
    }
    posIncAttribute.setPositionIncrement(positionIncrement);
    offsetAttribute.setOffset(0, endOffset);
    typeAttribute.setType(MIN_HASH_TYPE);
    posLenAttribute.setPositionLength(1);
    return true;
    } else {
    bucketPosition++;
    }
    }

    }
    bucketPosition = -1;
    hashPosition++;
    }
    }

    无限循环从三层结构里拿minhash值,将这些值构造为terms传入下一个filter.

    with_rotation 是填充项

    假如 bucket_count 为512 with_rotation 为false 则65个词,最后的min_hashes为

    "0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64"

    with_rotation 为true 则可能的示例为

    "65,1,1,1,1,1,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,8,8,8,9,9,10,11,12,13,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,15,15,16,16,17,17,17,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,18,19,19,19,20,20,20,20,21,21,22,22,22,22,22,22,22,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,25,25,25,26,26,26,26,26,26,27,27,27,27,27,27,28,28,28,28,28,28,28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,35,36,36,36,36,37,37,37,37,38,38,39,39,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,43,43,44,44,44,44,45,45,46,47,47,48,48,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,53,53,53,53,53,53,54,54,54,54,55,55,55,55,55,55,55,55,55,55,55,55,55,56,56,56,56,56,57,57,57,57,57,58,58,59,60,60,60,60,60,60,61,61,61,61,62,62,62,62,62,62,62,62,63,63,63,63,63,63,63,64,64,64,64,64,64,64,65,65,65,65,65,65"

    elasticsearch (确切的说是Lucene)的min_hash filter代码分析就这些,下一次开始应用测试

  • 相关阅读:
    多元化时代敏捷软件开发的崛起与传统软件工程的延续
    敏捷软件开发与传统软件工程概述比较
    结构化方法和面向对象方法的比较
    sql server 的Maintenance Plans(维护计划)详解
    sql server 如何查询出数据库作业所有者的信息并完成批量替换
    sql server 运维时CPU,内存,操作系统等信息查询(用sql语句)
    sql server 数据导出(入)方法总结
    sql server 转置 和实现随机分配和一串代码的含义拼在一行
    python 之路初(一):pycharm 安装 和 环境配置 和 中文乱码问题
    Qt5.9.6 vs2015 SQlite 数据库增删改查
  • 原文地址:https://www.cnblogs.com/zihunqingxin/p/7441050.html
Copyright © 2011-2022 走看看