simhash算法：海量千万级的数据去重

simhash算法及原理参考：

简单易懂讲解simhash算法 hash 哈希：https://blog.csdn.net/le_le_name/article/details/51615931

simhash算法及原理简介：https://blog.csdn.net/lengye7/article/details/79789206

使用SimHash进行海量文本去重：https://www.cnblogs.com/maybe2030/p/5203186.html#_label3

python实现：

python使用simhash实现文本相似性对比（全代码展示）：https://blog.csdn.net/weixin_43750200/article/details/84789361

simhash的py实现：https://blog.csdn.net/gzt940726/article/details/80460419

python库simhash使用

详情请查看：https://leons.im/posts/a-python-implementation-of-simhash-algorithm/


（1） 查看simhash值

>>> from simhash import Simhash
>>> print '%x' % Simhash(u'I am very happy'.split()).value
9f8fd7efdb1ded7f
Simhash()接收一个token序列，或者叫特征序列。

 

（2）计算两个simhash值距离

>>> hash1 = Simhash(u'I am very happy'.split())
>>> hash2 = Simhash(u'I am very sad'.split())
>>> print hash1.distance(hash2)


（3）建立索引

simhash被用来去重。如果两两分别计算simhash值，数据量较大的情况下肯定hold不住。有专门的数据结构，参考：http://www.cnblogs.com/maybe2030/p/5203186.html#_label4

复制代码
from simhash import Simhash, SimhashIndex
# 建立索引
data = {
u'1': u'How are you I Am fine . blar blar blar blar blar Thanks .'.lower().split(),
u'2': u'How are you i am fine .'.lower().split(),
u'3': u'This is simhash test .'.lower().split(),
}
objs = [(id, Simhash(sent)) for id, sent in data.items()]
index = SimhashIndex(objs, k=10) # k是容忍度；k越大，检索出的相似文本就越多
# 检索
s1 = Simhash(u'How are you . blar blar blar blar blar Thanks'.lower().split())
print index.get_near_dups(s1)
# 增加新索引
index.add(u'4', s1)
复制代码

查看全文

相关阅读:
Code First 二 DataAnnotation 数据注解
 Code First 一
 LINQ 方法
 Leetcode练习(Python)：栈类：第173题：二叉搜索树迭代器：实现一个二叉搜索树迭代器。你将使用二叉搜索树的根节点初始化迭代器。调用 next() 将返回二叉搜索树中的下一个最小的数。
Leetcode练习(Python)：栈类：第103题：二叉树的锯齿形层次遍历：给定一个二叉树，返回其节点值的锯齿形层次遍历。（即先从左往右，再从右往左进行下一层遍历，以此类推，层与层之间交替进行）。
Leetcode练习(Python)：栈类：第150题：逆波兰表达式求值：根据逆波兰表示法，求表达式的值。有效的运算符包括 +, -, *, / 。每个运算对象可以是整数，也可以是另一个逆波兰表达式。
Leetcode练习(Python)：栈类：第145题：二叉树的后序遍历：给定一个二叉树，返回它的后序遍历。
Leetcode练习(Python)：栈类：第144题：二叉树的前序遍历：给定一个二叉树，返回它的前序遍历。
Leetcode练习(Python)：栈类：第225题：用队列实现栈：使用队列实现栈的下列操作： push(x) -- 元素 x 入栈 pop() -- 移除栈顶元素 top() -- 获取栈顶元素 empty() -- 返回栈是否为空
 Leetcode练习(Python)：栈类：用栈实现队列：使用栈实现队列的下列操作： push(x) -- 将一个元素放入队列的尾部。 pop() -- 从队列首部移除元素。 peek() -- 返回队列首部的元素。 empty() -- 返回队列是否为空。

原文地址：https://www.cnblogs.com/nyist-xsk/p/13453652.html