场景:BloomFilter--大规模数据排重算法
优点:空间利用率高,保存不是数据本身,安全性好;
缺点:插入数据越大,误判率越高,不能删除元素
应用场景:防缓存击穿(数以十亿级别数据量,将不太适合用redis等缓存)
整体思路:举个例子,对url做过滤排重。
1、创建一个空的Bitmap集合
2、对url多次hash,一般为8次;
3、将hash结果放入BitMap集合:
同理如果有第二个Url,相同处理:
4、判断标准,将url通过r=HashA、HashB、HashC得到的结果,在BitMap[r] == 1(全部为1),认为重复;
误判是新的url,经过Hash,很可能出现:5,9,12,会把新的url认为重复。可以建立误判白名单。
demo:
package com.example.demo.bloomFilter; import java.util.BitSet; public class BloomFilter { /** * 默认长度 2 * Math.pow(2,24) */ private static final int DEFAULT_SIZE = 2 << 24; /** * 为质数,减少碰撞,原因: * 3: 0011 * 5: 0101 */ private static final int seeds[] = new int[]{3, 5, 7, 9, 11, 13, 17, 19}; private static Hash[] hashAr = new Hash[8]; static { for (int i = 0; i < seeds.length; i++) { hashAr[i] = new Hash(seeds[i]); } } /** * hash方法结果记录到bitSet */ private BitSet bitSet = new BitSet(DEFAULT_SIZE); /** * 将String经过Hash,结果放入bitSet * * @param content */ public void add(String content) { for (Hash h : hashAr) { bitSet.set(h.getHash(content)); } } /** * 是否包含 * * @param content * @return */ public boolean contains(String content) { boolean have = true; for (Hash hash : hashAr) { have &= bitSet.get(hash.getHash(content)); } return have; } public static void main(String[] args) { String email="xiaozhuanfeng@126.com"; BloomFilter bloomDemo=new BloomFilter(); System.out.println(email+"是否在列表中: "+bloomDemo.contains(email)); bloomDemo.add(email); System.out.println(email+"是否在列表中: "+bloomDemo.contains(email)); email="xiaozhuanfeng@163.com"; System.out.println(email+"是否在列表中: "+bloomDemo.contains(email)); } private static class Hash { private int seed = 0; public Hash(int seed) { this.seed = seed; } public int getHash(String string) { int val = 0; int len = string.length(); for (int i = 0; i < len; i++) { //与质数相乘+Assic码 val = val * seed + string.charAt(i); } //长度为(2的次幂-1),减少碰撞 //注意:&& 和&的区别(&& 第一表达式flase,第二表达式就不执行了,所以如果类似 val &= function()要注意) return val & (DEFAULT_SIZE - 1); } } }
参考:
https://mp.weixin.qq.com/s?__biz=MzIxMjE5MTE1Nw==&mid=2653191316&idx=1&sn=6b407704c99bda58440e97a2d6dd6ee9&chksm=8c990e4ebbee8758bf207b7fed8267bc1bda957f5864c00b467e2de6f0ae93563740b5527f25&mpshare=1&scene=1&srcid=0927TOixl26f0xogheOaXM1x&key=c38ae561692275b4c85347d76b993d2eeb8bdeaea465676770fb28835462fbc7d92f66816cbf4adb29af15b479e88b00109901f88a846c4c5c921bd228fd1dfa37cdee015d81561d5052c7f31230447c&ascene=0&uin=MjE4MTczNDcwMA%3D%3D&devicetype=iMac+MacBookAir6%2C1+OSX+OSX+10.12.5+build(16F73)&version=12020