ElastaticSearch 去重

zoukankan html css js c++ java

ElastaticSearch 去重

最近入坑了，ElastaticSearch 计算(count)所有的个数，按某一个字段出现多次计算一次，所有有两种解决方案：

1.cardinality(重复字段)

如果要求容错率较低，可以用cardinality（2.X,其他也类似，40000之内能够基本准确，且不管查询的文档量是多少，即便是百万级也能够保证错误率在5%以下）。官方的介绍

This example will ensure that fields with 100 or fewer distinct values will be extremely accurate. Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric.

For a given threshold, the HLL data-structure will use about precision_threshold * 8 bytes of memory. So you must balance how much memory you are willing to sacri‐ fice for additional accuracy.

Practically speaking, a threshold of 100 maintains an error under 5% even when counting millions of unique values.

2.terms(重复字段)

在此方法下记得terms().field()是计算10个，所以需要记得在后面加上.size(0),我就是应为忘记了，所以计算了很多值都是10以及10以下．第一次可以有错，后面一定要引以为鉴．当然，计算个数时也变成了getBuckets().size()，其中我们要记得需要判断detBuckets().get(0).getDocCount()是否为0，为0时不统计．数据精确但是耗时需要多次校验．

如果有什么更好的方法，希望大家留言，让大家都试试．

查看全文

相关阅读:
页面后退的总结
 Flash Builder4.6 无法启动，并且报 Failed to create the Java Virtual Machine(1不行的话可以参考下2)
单独的js代码文件被JSP文件调用，中文乱码问题
 Flash Builder4.6 破解方法的实践
 sql文学习.....关于条件判断的查询....casewhenthen
解决flash builder 4.6安装过程中安装程序遇到错误(1)
flex builder 4 控制台不能输出trace()的解决方法
 jstl遍历map,foreach
jar包直接拷贝到WEBINF/lib下和以userLibrary形式引入的区别？/jar包放置在WEBINF/lib下和通过build path导入的区别是什么？
flash build 4.6 不能debug 报错 C:\WINDOWS\system32\Macromed\Flash\NPSWF32.dll

原文地址：https://www.cnblogs.com/antime/p/7814614.html