zoukankan      html  css  js  c++  java
  • Elasticsearch 第六篇:聚合统计查询

          前面一直没有记录 Elasticsearch 的聚合查询或者其它复杂的查询。本篇做一下笔记,为了方便测试,索引数据依然是第五篇生成的测试索引库 db_student_test ,别名是 student_test 

    第一部分 基本聚合

    1、最大值 max、最小值 min、平均值 avg 、总和 sum

    场景:查询语文、数学、英语 这三科的最大值、最小值、平均值

    POST  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "max_chinese" : { "max" : { "field" : "chinese" } },
            "min_chinese" : { "min" : { "field" : "chinese" } },
            "avg_chinese" : { "avg" : { "field" : "chinese" } },
            "max_math": { "max" : { "field" : "math" } },
            "min_math": { "min" : { "field" : "math" } },
            "avg_math": { "avg" : { "field" : "math" } },
            "max_english": { "max" : { "field" : "english" } },
            "min_english": { "min" : { "field" : "english" } },
            "avg_english": { "avg" : { "field" : "english" } }
        }
    }

    查询结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "avg_english": {
                "value": 57.78366490546798
            },
            "max_chinese": {
                "value": 98
            },
            "min_chinese": {
                "value": 25
            },
            "min_math": {
                "value": 15
            },
            "max_english": {
                "value": 98
            },
            "avg_chinese": {
                "value": 59.353859695794505
            },
            "avg_math": {
                "value": 56.92907568735187
            },
            "min_english": {
                "value": 21
            },
            "max_math": {
                "value": 99
            }
        }
    }

    也可以来查询语文科目分数总和,相当于 sql 的 sum 逻辑,虽然在这里并没有什么意义:

    POST  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "sum_chinese" : { "sum" : { "field" : "chinese" } }
        }
    }

    2、求个数,相当于 sql 的 count 逻辑

    场景:查询所有学生总数,这里随便 count 一个 字段就可以,例如数学这个字段

    POST  http://localhost:9200/student_test1/_search?size=0
    {
      "aggs": {
        "age_count": {
          "value_count": {
            "field": "math"
          }
        }
      }
    }

    返回结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "age_count": {
                "value": 50084828
            }
        }
    }

    课间总数是:50084828 跟第五篇我们生成的数据总量一致

    3、distinct 聚合,相当于 sql  的  count ( distinct )

    场景:统计语文成绩有多少种值

    POST  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "type_count" : {
                "cardinality" : {
                    "field" : "chinese"
                }
            }
        }
    }

    返回结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "type_count": {
                "value": 74
            }
        }
    }

    从结果上看,只有74个不同的分数,与第五篇随机生成数据的规则匹配

    4、统计聚合

    场景:查询语文成绩 总个数、最大值、最小值、平均值、总和等

    POST  http://localhost:9200/student_test1/_search?size=0
    {
      "aggs": {
        "chinese_stats": {
          "stats": {
            "field": "chinese"
          }
        }
      }
    }

    返回结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "chinese_stats": {
                "count": 50084828,
                "min": 25,
                "max": 98,
                "avg": 59.353859695794505,
                "sum": 2972727854
            }
        }
    }

    5、加强版统计聚合,查询结果在上面的基础上,加上方差等统计学上的数据

    POST  http://localhost:9200/student_test1/_search?size=0 
    {
      "aggs": {
        "chinese_stats": {
          "extended_stats": {
            "field": "chinese"
          }
        }
      }
    }

    6、分位聚合统计

    默认的分位是 1%  5%  25%  50%  75%  95%  99%  《= 的概念

    分位数的概念:25% 的分位数是 54,意思是小于等于 54 的样本占据了总样本的 25% ,即是 54 这个数将最底层的1/4 的数据分割出来。

    POST  http://localhost:9200/student_test1/_search?size=0 
    {
      "aggs": {
        "chinese_percents": {
          "percentiles": {
            "field": "chinese"
          }
        }
      }
    }

    也可以自定义分位:

    POST  http://localhost:9200/student_test1/_search?size=0 
    {
      "aggs": {
        "chinese_percents": {
          "percentiles": {
            "field": "chinese",
            "percents" : [10,20,30,40,50,60,70,80,90] 
          }
        }
      }
    }

    7、范围聚合统计

    场景:分别查询语文成绩小于40分、小于50分、小于60分的比例

    POST  http://localhost:9200/student_test1/_search?size=0 
    {
      "aggs": {
        "gge_perc_rank": {
          "percentile_ranks": {
            "field": "chinese",
            "values": [40,50,60]
          }
        }
      }
    }

    以上是查询成绩小于40,小于50,小于60的占比,得到的数据是: 21.29%   36.09%   51.12%  可以看到这是一个接近等差的数列,可见测试数据的随机性还是很好的。

    第二部分 其它聚合方式

    1、Term 聚合

    场景:想知道学生的语文成绩,在所有分数值上的个数

    POST  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "genres" : {
                "terms" : { 
                    "field" : "chinese"
                }
            }
        }
    }

    这个查询会将字段Chinese进行聚合,例如87分聚合成一个组,88分聚合成一个组,等等;

    但是这里默认是按组的大小排序,而且不会将所有的组都显示出来,数量太小的组可能被忽略,查询结果如下:

    {
        "took": 1,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "genres": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 42560269,
                "buckets": [
                    {
                        "key": 61,
                        "doc_count": 752863
                    },
                    {
                        "key": 68,
                        "doc_count": 752835
                    },
                    {
                        "key": 55,
                        "doc_count": 752749
                    },
                    {
                        "key": 59,
                        "doc_count": 752444
                    },
                    {
                        "key": 76,
                        "doc_count": 752405
                    },
                    {
                        "key": 74,
                        "doc_count": 752309
                    },
                    {
                        "key": 56,
                        "doc_count": 752283
                    },
                    {
                        "key": 49,
                        "doc_count": 752273
                    },
                    {
                        "key": 52,
                        "doc_count": 752201
                    },
                    {
                        "key": 50,
                        "doc_count": 752197
                    }
                ]
            }
        }
    }

    如果想要自定义筛选条件,Term聚合还可以按照以下设定来查询:

    post  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "genres" : {
                "terms" : { 
                    "field" : "chinese",
                     "size" : 100,                     // 可能有100个不用的分数,我们将全部都展示出来
                     "order" : { "_count" : "asc" },   // 按照组数由小到大排序
                      "min_doc_count": 752200          //过滤条件:组数最小值是752200
                }
            }
        }
    }

    查询结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "genres": {
                "doc_count_error_upper_bound": 0,
                "sum_other_doc_count": 0,
                "buckets": [
                    {
                        "key": 52,
                        "doc_count": 752201
                    },
                    {
                        "key": 49,
                        "doc_count": 752273
                    },
                    {
                        "key": 56,
                        "doc_count": 752283
                    },
                    {
                        "key": 74,
                        "doc_count": 752309
                    },
                    {
                        "key": 76,
                        "doc_count": 752405
                    },
                    {
                        "key": 59,
                        "doc_count": 752444
                    },
                    {
                        "key": 55,
                        "doc_count": 752749
                    },
                    {
                        "key": 68,
                        "doc_count": 752835
                    },
                    {
                        "key": 61,
                        "doc_count": 752863
                    }
                ]
            }
        }
    }

     2、Filter 聚合

    Filter 聚合会先进行条件过滤,在进行聚合

    场景:查询华南理工大学的学生的数学科目平均分(先筛选学校,再进行分数统计聚合)

    {
        "aggs" : {
            "scut_math_avg" : {
                "filter" : { "term": { "school": "华南理工大学" } },
                "aggs" : {
                    "avg_price" : { "avg" : { "field" : "math" } }
                }
            }
        }
    }

    查询结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "scut_math_avg": {
                "doc_count": 1854993,
                "avg_price": {
                    "value": 56.93080027795253
                }
            }
        }
    }

     3、Filters 多重聚合

    场景:查询各个学校,语文、数学、英语的平均分都是多少,可以采用多重聚合,速度可能有点慢,如下

    POST  http://localhost:9200/student_test1/_search?size=0
    {
      "aggs" : {
        "messages" : {
          "filters" : {
            "filters" : {
              "school_1" :   { "term" : { "school" : "华南理工大学" }},
              "school_2" : { "term" : { "school" : "中山大学" }},
              "school_3" : { "match" : { "school" : "暨南大学" }}
            }
          },
          "aggs" : {
               "avg_chinese" : { "avg" : { "field" : "chinese" } },
               "avg_math" : { "avg" : { "field" : "math" } }
          }
        }
      }
    }

    于是得到结果:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "messages": {
                "buckets": {
                    "school_1": {
                        "doc_count": 1854993,
                        "avg_chinese": {
                            "value": 59.353236912484306
                        },
                        "avg_math": {
                            "value": 56.93080027795253
                        }
                    },
                    "school_2": {
                        "doc_count": 1855016,
                        "avg_chinese": {
                            "value": 59.349129064115886
                        },
                        "avg_math": {
                            "value": 56.93540918245449
                        }
                    },
                    "school_3": {
                        "doc_count": 44519876,
                        "avg_chinese": {
                            "value": 59.35397212247402
                        },
                        "avg_math": {
                            "value": 56.92948502372289
                        }
                    }
                }
            }
        }
    }

     4、Range 范围聚合

    场景:想要查询语文成绩各个分数段的人数,可以这样查询

    POST  http://localhost:9200/student_test1/_search?size=0
    {
    "aggs" : { "chinese_ranges" : { "range" : { "field" : "chinese", "ranges" : [ { "to" : 60 }, { "from" : 60, "to" : 75 }, { "from" : 75, "to" : 85 }, { "from" : 85 } ] } } } }

    查询结果是:

    {
        "took": 0,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "chinese_ranges": {
                "buckets": [
                    {
                        "key": "*-60.0",
                        "to": 60,
                        "doc_count": 25096839
                    },
                    {
                        "key": "60.0-75.0",
                        "from": 60,
                        "to": 75,
                        "doc_count": 11278543
                    },
                    {
                        "key": "75.0-85.0",
                        "from": 75,
                        "to": 85,
                        "doc_count": 7424634
                    },
                    {
                        "key": "85.0-*",
                        "from": 85,
                        "doc_count": 6284812
                    }
                ]
            }
        }
    }

    这个返回结果的组名分别是 *-60.0 60.0-75.0 75.0-85.0 85.0-*
    如果我们不想要这样的组名,可以自定义组名,例如:

    POST  http://localhost:9200/student_test1/_search?size=0
    {
        "aggs" : {
            "chinese_ranges" : {
                "range" : {
                    "field" : "chinese",
                    "keyed" : true,
                    "ranges" : [
                        { "key" : "不及格", "to" : 60 },
                        { "key" : "及格", "from" : 60, "to" : 75 },
                        { "key" : "良好", "from" : 75, "to" : 85 },
                        { "key" : "优秀", "from" : 85 }
                    ]
                }
            }
        }
    }

    查询结果将会是:

    {
        "took": 1675,
        "timed_out": false,
        "_shards": {
            "total": 1,
            "successful": 1,
            "skipped": 0,
            "failed": 0
        },
        "hits": {
            "total": {
                "value": 10000,
                "relation": "gte"
            },
            "max_score": null,
            "hits": []
        },
        "aggregations": {
            "chinese_ranges": {
                "buckets": {
                    "不及格": {
                        "to": 60,
                        "doc_count": 25096839
                    },
                    "及格": {
                        "from": 60,
                        "to": 75,
                        "doc_count": 11278543
                    },
                    "良好": {
                        "from": 75,
                        "to": 85,
                        "doc_count": 7424634
                    },
                    "优秀": {
                        "from": 85,
                        "doc_count": 6284812
                    }
                }
            }
        }
    }

     还有其它各种各样的、复杂的聚合查询,都是可以网上查资料,甚至还支持推荐系统的一些计算方法,例如矩阵的概念等等。

     还可以参考 https://blog.csdn.net/alex_xfboy/article/details/86100037

  • 相关阅读:
    进程,线程(thread)
    PHP权限分配思路
    xml simpleXML_load_file(), simpleXML_load_string()
    XML基础
    php QQ登录
    ci验证码
    jquery中的 .html(),.val().text()
    ci 用本身 email 类发 email
    Objective-C 装饰模式--简单介绍和使用
    Objective-C 外观模式--简单介绍和使用
  • 原文地址:https://www.cnblogs.com/vincentfhr/p/13937040.html
Copyright © 2011-2022 走看看