zoukankan      html  css  js  c++  java
  • 布隆算法原理

    场景:BloomFilter--大规模数据排重算法

    优点:空间利用率高,保存不是数据本身,安全性好;

    缺点:插入数据越大,误判率越高,不能删除元素

    应用场景:防缓存击穿(数以十亿级别数据量,将不太适合用redis等缓存)

    整体思路:举个例子,对url做过滤排重。

    1、创建一个空的Bitmap集合

    2、对url多次hash,一般为8次;

    3、将hash结果放入BitMap集合:

    同理如果有第二个Url,相同处理:

     4、判断标准,将url通过r=HashA、HashB、HashC得到的结果,在BitMap[r] == 1(全部为1),认为重复;

    误判是新的url,经过Hash,很可能出现:5,9,12,会把新的url认为重复。可以建立误判白名单。

    demo:

    package com.example.demo.bloomFilter;
    
    import java.util.BitSet;
    
    public class BloomFilter {
        /**
         * 默认长度  2 * Math.pow(2,24)
         */
        private static final int DEFAULT_SIZE = 2 << 24;
    
        /**
         * 为质数,减少碰撞,原因:
         * 3: 0011
         * 5: 0101
         */
        private static final int seeds[] = new int[]{3, 5, 7, 9, 11, 13, 17, 19};
        private static Hash[] hashAr = new Hash[8];
    
        static {
            for (int i = 0; i < seeds.length; i++) {
                hashAr[i] = new Hash(seeds[i]);
            }
        }
    
        /**
         * hash方法结果记录到bitSet
         */
        private BitSet bitSet = new BitSet(DEFAULT_SIZE);
    
        /**
         * 将String经过Hash,结果放入bitSet
         *
         * @param content
         */
        public void add(String content) {
            for (Hash h : hashAr) {
                bitSet.set(h.getHash(content));
            }
        }
    
        /**
         * 是否包含
         *
         * @param content
         * @return
         */
        public boolean contains(String content) {
            boolean have = true;
            for (Hash hash : hashAr) {
                have &= bitSet.get(hash.getHash(content));
            }
            return have;
        }
    
        public static void main(String[] args) {
            String email="xiaozhuanfeng@126.com";
            BloomFilter bloomDemo=new BloomFilter();
            System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
            bloomDemo.add(email);
            System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
            email="xiaozhuanfeng@163.com";
            System.out.println(email+"是否在列表中: "+bloomDemo.contains(email));
        }
    
        private static class Hash {
            private int seed = 0;
    
            public Hash(int seed) {
                this.seed = seed;
            }
    
            public int getHash(String string) {
                int val = 0;
                int len = string.length();
                for (int i = 0; i < len; i++) {
    
                    //与质数相乘+Assic码
                    val = val * seed + string.charAt(i);
                }
    
                //长度为(2的次幂-1),减少碰撞
                //注意:&& 和&的区别(&& 第一表达式flase,第二表达式就不执行了,所以如果类似  val &= function()要注意)
                return val & (DEFAULT_SIZE - 1);
            }
        }
    }

    参考:

    https://mp.weixin.qq.com/s?__biz=MzIxMjE5MTE1Nw==&mid=2653191316&idx=1&sn=6b407704c99bda58440e97a2d6dd6ee9&chksm=8c990e4ebbee8758bf207b7fed8267bc1bda957f5864c00b467e2de6f0ae93563740b5527f25&mpshare=1&scene=1&srcid=0927TOixl26f0xogheOaXM1x&key=c38ae561692275b4c85347d76b993d2eeb8bdeaea465676770fb28835462fbc7d92f66816cbf4adb29af15b479e88b00109901f88a846c4c5c921bd228fd1dfa37cdee015d81561d5052c7f31230447c&ascene=0&uin=MjE4MTczNDcwMA%3D%3D&devicetype=iMac+MacBookAir6%2C1+OSX+OSX+10.12.5+build(16F73)&version=12020

  • 相关阅读:
    Python 实践
    Keras实践
    NLP S实践
    Spark java 实践
    Seaborn数据探索可视化
    Linux实践
    Redis
    ML算法选型
    Elasticsearch issue
    牛客练习赛37
  • 原文地址:https://www.cnblogs.com/xiaozhuanfeng/p/10858426.html
Copyright © 2011-2022 走看看