zoukankan      html  css  js  c++  java
  • ElasticSearch Cardinality Aggregation聚合计算的误差

    使用ES不久,今天发现生产环境数据异常,其使用的ES版本是2.1.2,其它版本也类似。通过使用ES的HTTP API进行查询,发现得到的数据跟javaClient API 查询得到的数据不一致,于是对代码逻辑以及ES查询工具产生了怀疑。通过查阅官方文档找到如下描述:

    Precision controledit

    This aggregation also supports the precision_threshold option:

    Warning

    The precision_threshold option is specific to the current internal implementation of the cardinality agg, which may change in the future

    {
        "aggs" : {
            "author_count" : {
                "cardinality" : {
                    "field" : "author_hash",
                    "precision_threshold": 100 
                }
            }
        }
    }

    The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. Default value depends on the number of parent aggregations that multiple create buckets (such as terms or histograms).

    Counts are approximateedit

    Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

    This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

    • configurable precision, which decides on how to trade memory for accuracy,
    • excellent accuracy on low-cardinality sets,
    • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

    For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

    The following chart shows how the error varies before and after the threshold:

    images/cardinality_error.png

    For all 3 thresholds, counts have been accurate up to the configured threshold (although not guaranteed, this is likely to be the case). Please also note that even with a threshold as low as 100, the error remains under 5%, even when counting millions of items.

     

     

      其意思就是:聚合查询存在误差,在5%范围之内,通过调整“precision_threshold”参数进行调整。

      于是翻阅查询代码:加入如下部分问题得到解决。该参数在查询时未设置的情况下,默认值为3000。

      

     private void buildSearchQueryForAgg(NativeSearchQueryBuilder nativeSearchQueryBuilder) {
            // 设置聚合条件
            TermsBuilder agg = AggregationBuilders.terms(aggreName).field(XXX.XXX).size(Integer.MAX_VALUE);
    
            // 查询条件构建
            BoolQueryBuilder packBoolQuery = QueryBuilders.boolQuery();
            FilterAggregationBuilder packAgg = AggregationBuilders.filter(xxx).filter(packBoolQuery);
           
            packAgg.subAggregation(AggregationBuilders.cardinality(xxx).field(ZZZZ.XXX).precisionThreshold(CARDINALITY_PRECISION_THRESHOLD));//指定精度值
            agg.subAggregation(packAgg);
    
    
            nativeSearchQueryBuilder.addAggregation(agg);
        }
  • 相关阅读:
    Java的值传递和引用传递的说法
    将对象写道硬盘上and从硬盘上读入对象
    分割一个文件and合并一个文件(并且带有配置信息记录)
    文件的切割和合并
    SequenceInputStream的用法(用来合并流然后一起操作)
    PrintStream和PrintWrite用法
    将一个文件夹中的所有含有某个后缀的文件写进一个文件里面
    关于Properties的制作配置文件(当一个app已经5次被打开我们就收费)
    Properties的用法
    深层删除一个目录(java)
  • 原文地址:https://www.cnblogs.com/sunlightlee/p/10567874.html
Copyright © 2011-2022 走看看