zoukankan      html  css  js  c++  java
  • 关于敏感哈希

    Locality Sensitive Hashing

    nearest neighbor

    given a set P of n points in (cal R^d)

    Want to build a data structure to answer nearest neighbor queries

    Voronoi Diagram

    Build a Voronoi diagram & a point location data structure

    Voronoi图,又叫泰森多边形或Dirichlet图,它是由一组由连接两邻点直线的垂直平分线组成的连续多边形组成。

    img

    Curse of dimensionality

    In R2 the Voronoi diagram is of size O(n)

    Query takes O(logn) time

    In Rd the complexity is (O(n^{d/2}))

    Other techniques also scale bad with the dimension

    Locality Sensitive Hashing

    We will use a family of hash functions such that close points tend to hash to the same bucket.

    Put all points of P in their buckets, ideally we want the query q to find its nearest neighbor in its bucket

    Def (Charikar):

    A family H of functions is locality sensitive with respect to a similarity function 0 ≤ sim(p,q) ≤ 1 if

    [Pr[h(p) = h(q)] = sim(p,q) ]

    Example – Hamming Similarity

    Think of the points as strings of m bits and consider the similarity (sim(p,q) = 1-ham(p,q)/m)

    $H={h_i(p) = the i-th bit of p} $is locality sensitive wrt

    (sim(p,q) = 1-ham(p,q)/m)

    $Pr[h(p) = h(q)] = 1 – ham(p,q)/m $

    (1-sim(p,q) = ham(p,q)/m)

    Example - Jaacard

    Think of p and q as sets

    (sim(p,q) = jaccard(p,q) = |pcap q|/|pcup q|)

    (H={h_{pi}(p) = min in pi of the items in p})

    (Pr[h_{pi}(p) = h_{pi}(q)] = jaccard(p,q))

    Need to pick $pi $ from a min-wise ind. family of permutations

    Map to {0,1}

    Draw a function b to 0/1 from a pairwise ind. family B

    So: (h(p) eq h(q) Rightarrow b(h(p)) = b(h(q)) = 1/2)

    (H’={b(h()) | hin H, bin B})

    Another example (“simhash”)

    [H ={h_r(p) = 1 ,if r·p > 0, 0 otherwise | r is a random unit vector} ]

    How do we really use it?

    Reduce the number of false positives by concatenating hash function to get new hash functions (“signature”)

    [sig(p) = h1(p)h2(p) h3(p)h4(p)…… = 00101010 ]

    Very close documents are hashed to the same bucket or to ‘’close” buckets ((ham(sig(p),sig(q))) is small)

    See papers on removing almost duplicates…

    A theoretical result on NN

    Locality Sensitive Hashing

    Thm: If there exists a family H of hash functions such that

    (Pr[h(p) = h(q)] = sim(p,q))

    then (d(p,q) = 1-sim(p,q)) satisfies the triangle inequality

    Alternative Def (Indyk-Motwani):

    A family H of functions is ((r1 < r2,p1 > p2))-sensitive if

    • $d(p,q) ≤ r1 ightarrow Pr[h(p) = h(q)] ≥ p1 $

    • $d(p,q) ≥ r2 ightarrow Pr[h(p) = h(q)] ≤ p2 $

    If $d(p,q) = 1-sim(p,q) then this holds with p_1 = 1-r1 and p2=1-r_2 for any r_1, r_2 $

    (r,ε)-neighbor problem

    • If there is a neighbor p, such that d(p,q)(leq)r, return p’, s.t. d(p’,q) (leq) (1+ε)r.

    • If there is no p s.t. d(p,q)(leq)(1+ε)r return nothing.

    is the real req. since if we satisfy (1) only, we can satisfy (2) by filtering answers that are too far)

    • Lets construct a data structure that succeeds with constant probability
    • Focus on the hamming distance first

    NN using locality sensitive hashing

    Take a (r1 < r2, p1 > p2) = (r < (1+(epsilon))r, 1-r/m > 1-(1+(epsilon))r/m) - sensitive family

    If there is a neighbor at distance r we catch it with probability (p_1)so to guarantee catching it we need $1/p_1 $functions..

    But we also get false positives in our 1/p1 buckets, how many ? np2/p1

    注1:我以后描述点集的方式可以先整个Voronoi图

    注2:min-wise没看懂

    注3:concatenating连接

    注4:Alternative Def 另一种定义

  • 相关阅读:
    zabbix短信网关调用问题总结
    zabbix短信接口调用
    Windows Open with Sublime Text
    [转载]windows下安装Python虚拟环境virtualenvwrapper-win
    Resilio-sync auto restart
    django-orm-standalone
    RabbitMQ笔记
    RabbitMQ启动出错:- unable to connect to epmd on xxxx: timeout (timed out)
    [Python笔记]第十六篇:web框架之Tornado
    [前端笔记]第三篇:JavaScript
  • 原文地址:https://www.cnblogs.com/xiaofeisnote/p/13443669.html
Copyright © 2011-2022 走看看