1,SimHash
https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html
64位Hash为什么海明距离选3?
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf
上链接右上precision-recall 曲线图,3是最平衡点,即不会错判太多重复,也不会漏掉很多。
SimHash第一步需抽关键词(feature),并有权重(weight,这个一般是基于统计的?没有词库的默认就用1了)
Feature算法,选 slide windows方法,golang代码如下:
func (t *OverlappingStringTokeniser) Tokenise(input string) []string { var chunks []string inputLen := len(input) for position := 0; position < inputLen-int(t.chunkSize); position += int(t.chunkSize - t.overlapSize) { chunks = append(chunks, input[position:position+int(t.chunkSize)]) } return chunks }
1)简单
2)不比其实算法效果差
计算海明距离:
// Compare calculates the Hamming distance between two 64-bit integers // // Currently, this is calculated using the Kernighan method [1]. Other methods // exist which may be more efficient and are worth exploring at some point // // [1] http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan func Compare(a uint64, b uint64) uint8 { v := a ^ b var c uint8 for c = 0; v != 0; c++ { v &= v - 1 } return c }