zoukankan      html  css  js  c++  java
  • simhash

    1,SimHash

    https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

    64位Hash为什么海明距离选3?

    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf

    上链接右上precision-recall 曲线图,3是最平衡点,即不会错判太多重复,也不会漏掉很多。

    SimHash第一步需抽关键词(feature),并有权重(weight,这个一般是基于统计的?没有词库的默认就用1了)

    Feature算法,选 slide windows方法,golang代码如下:

    func (t *OverlappingStringTokeniser) Tokenise(input string) []string {
      var chunks []string
      inputLen := len(input)
      for position := 0; position < inputLen-int(t.chunkSize); position += int(t.chunkSize - t.overlapSize) {
        chunks = append(chunks, input[position:position+int(t.chunkSize)])
      }
      return chunks
    }
    

    1)简单

    2)不比其实算法效果差

    计算海明距离:

    // Compare calculates the Hamming distance between two 64-bit integers
    //
    // Currently, this is calculated using the Kernighan method [1]. Other methods
    // exist which may be more efficient and are worth exploring at some point
    //
    // [1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
    func Compare(a uint64, b uint64) uint8 {
    	v := a ^ b
    	var c uint8
    	for c = 0; v != 0; c++ {
    		v &= v - 1
    	}
    	return c
    }
    

      

  • 相关阅读:
    day_03、初识函数
    一、Python简介
    四、列表和元祖
    MinGW安装和使用
    python编码总结
    开博第一篇,学习markdown
    python 2.7.11安装pywin32过程中 停止工作问题
    python读写excel
    关于\r和\n的区别
    1.1 Python for macOS 安装与配置
  • 原文地址:https://www.cnblogs.com/gm-201705/p/8458859.html
Copyright © 2011-2022 走看看