zoukankan      html  css  js  c++  java
  • [SimHash] the Hash-based Similarity Detection Algorithm

    The current information explosion has resulted in an increasing number of applications that need to deal with large volumes of data. While many of the data contains useless redundancy data, especially in mass media, web crawler/analytic fields, wasted many precious resources (power, bandwidth, CPU and storage, etc.). This has resulted in an increased interest in algorithms that process the input data in restricted ways.

    But traditional hash algorithms have two problems, first it assumes that the data fits in main memory, it is unreasonable when dealing with massive data such as multimedia data, web crawler/analytic repositories and so on. And second, traditional hash can only indentify the identical data. this brings to light the importance of simhash.

     

    Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

    • tokenize

      • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business

    • hash (md5, SHA1)

      • calculate token's hash value and convert it to binary (101011 )

    • weigh values

      • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)

    • merge

      • add up tokens' values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)

    • Dimensionality Reduction

      • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

    How to use SimHash fingerprints?

    Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

    Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data often differ appreciably in magnitude from those unsimilar data.

    how to calculate Hamming distance: 

      XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

     

      

    SimHash algorithm, introduced by Charikar and is patented by Google.

    simhash 0.1.0 : Python Package Index

  • 相关阅读:
    军火库(第一期):无线电硬件安全大牛都用哪些利器?
    AMD64与IA64的区别
    win7安装apache或者php 5.7缺少vcruntime140.dll的问题
    DrawText
    Delphi与C语言类型转换对照
    chm文件打开空白无内容的解决办法
    在ubuntu 15.04下安装VMware Tools
    Vmware怎样使用nat和桥接方式解决虚拟机联网问题
    Ubuntu 14.04/14.10下安装VMware Workstation 11图文教程
    Ubuntu 16.04 安装 VMware-Workstation-12
  • 原文地址:https://www.cnblogs.com/scottgu/p/5542184.html
Copyright © 2011-2022 走看看