zoukankan      html  css  js  c++  java
  • ElastaticSearch 去重

    最近入坑了,ElastaticSearch 计算(count)所有的个数,按某一个字段出现多次计算一次,所有有两种解决方案:

    1.cardinality(重复字段)

    如果要求容错率较低,可以用cardinality(2.X,其他也类似,40000之内能够基本准确,且不管查询的文档量是多少,即便是百万级也能够保证错误率在5%以下)。官方的介绍

    This example will ensure that fields with 100 or fewer distinct values will be extremely accurate. Although not guaranteed by the algorithm, if a cardinality is under the threshold, it is almost always 100% accurate. Cardinalities above this will begin to trade accuracy for memory savings, and a little error will creep into the metric.

    For a given threshold, the HLL data-structure will use about precision_threshold * 8 bytes of memory. So you must balance how much memory you are willing to sacri‐ fice for additional accuracy.

    Practically speaking, a threshold of 100 maintains an error under 5% even when counting millions of unique values.

    2.terms(重复字段)

    在此方法下记得terms().field()是计算10个,所以需要记得在后面加上.size(0),我就是应为忘记了,所以计算了很多值都是10以及10以下.第一次可以有错,后面一定要引以为鉴.当然,计算个数时也变成了getBuckets().size(),其中我们要记得需要判断detBuckets().get(0).getDocCount()是否为0,为0时不统计.数据精确但是耗时需要多次校验.

    如果有什么更好的方法,希望大家留言,让大家都试试.

  • 相关阅读:
    go学习-环境安装1-WIN10如何安装wsl2
    go学习-环境安装2-wsl2安装docker
    go学习-使用swagger生成接口文档
    go学习-WSL安装gcc
    go学习-如何修改Docker0的IP地址的默认网段
    go学习-go-sqlmock数据库操作测试
    go学习-环境安装3-wsl安装golang
    java基础学习-Stream API
    go学习-gorm
    go学习-获取form表单提交数据
  • 原文地址:https://www.cnblogs.com/antime/p/7814614.html
Copyright © 2011-2022 走看看