zoukankan      html  css  js  c++  java
  • Elasticsearch 聚合分析(Aggregation)

    Aggregation的语法

    Metric - 单值输出 & 多值输出

    Aggregation 属于 Search 的 一部分。一般情况下,建议将其 Size 指定为 0。以工资统计信息为例:

    工资统计信息

    先插入工资数据:

    DELETE /employees
    PUT /employees/
    {
      "mappings" : {
          "properties" : {
            "age" : {
              "type" : "integer"
            },
            "gender" : {
              "type" : "keyword"
            },
            "job" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 50
                }
              }
            },
            "name" : {
              "type" : "keyword"
            },
            "salary" : {
              "type" : "integer"
            }
          }
        }
    }
    
    PUT /employees/_bulk
    { "index" : {  "_id" : "1" } }
    { "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
    { "index" : {  "_id" : "2" } }
    { "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
    { "index" : {  "_id" : "3" } }
    { "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
    { "index" : {  "_id" : "4" } }
    { "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
    { "index" : {  "_id" : "5" } }
    { "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
    { "index" : {  "_id" : "6" } }
    { "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
    { "index" : {  "_id" : "7" } }
    { "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
    { "index" : {  "_id" : "8" } }
    { "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
    { "index" : {  "_id" : "9" } }
    { "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
    { "index" : {  "_id" : "10" } }
    { "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
    { "index" : {  "_id" : "11" } }
    { "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
    { "index" : {  "_id" : "12" } }
    { "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
    { "index" : {  "_id" : "13" } }
    { "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
    { "index" : {  "_id" : "14" } }
    { "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
    { "index" : {  "_id" : "15" } }
    { "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
    { "index" : {  "_id" : "16" } }
    { "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
    { "index" : {  "_id" : "17" } }
    { "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
    { "index" : {  "_id" : "18" } }
    { "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
    { "index" : {  "_id" : "19" } }
    { "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
    { "index" : {  "_id" : "20" } }
    { "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}
    
    # 多个 Metric 聚合,找到最低最高和平均工资
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "max_salary": {
          "max": {
            "field": "salary"
          }
        },
        "min_salary": {
          "min": {
            "field": "salary"
          }
        },
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
    
    # Metric 聚合,找到最低的工资
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "min_salary": {
          "min": {
            "field":"salary"
          }
        }
      }
    }
    
    # Metric 聚合,找到最高的工资
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "max_salary": {
          "max": {
            "field":"salary"
          }
        }
      }
    }
    
    # 一个聚合,输出多值
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "stats_salary": {
          "stats": {
            "field":"salary"
          }
        }
      }
    }
    

    Bucket - Terms & 数字范围

    Bucket

    按照⼀定的规则,将⽂档分配到不同的 桶中,从⽽达到分类的⽬的。ES 提供的 ⼀些常⻅见的 Bucket Aggregation:

    • terms
    • 数组类型:Range / Data Range,Histogram / Date Histogram
    • ⽀持嵌套:也就在桶⾥再做分桶

    Terms aggregation

    Terms aggretion 字段需要打开 fielddata,才能进行 Terms aggregation,keyword 默认支持 doc_values,Text 需要在 Mapping 中 enable。

    # 对keword 进行聚合
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field":"job.keyword"
          }
        }
      }
    }
    
    
    # 对 Text 字段进行 terms 聚合查询,失败
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field":"job"
          }
        }
      }
    }
    
    # 对 Text 字段打开 fielddata,支持terms aggregation
    PUT employees/_mapping
    {
      "properties" : {
        "job":{
           "type":     "text",
           "fielddata": true
        }
      }
    }
    
    # 对 Text 字段进行 terms 分词。分词后的terms
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field":"job"
          }
        }
      }
    }
    
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field":"job.keyword"
          }
        }
      }
    }
    
    
    

    Cardinality,类似 SQL 中的 Distinct

    # 对job.keyword 和 job 进行 terms 聚合,分桶的总数并不一样
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "cardinate": {
          "cardinality": {
            "field": "job"
          }
        }
      }
    }
    

    Bucket Size & Top Hits Demo

    • 应⽤场景:当获取分桶后,桶内最匹配的顶部⽂档列表
    • Size:按年龄分桶,找出指定数据量的分桶信息
    • Top Hits:查看各个⼯种中,年纪最⼤的 3 名员⼯
    # 对性别的 keyword 进行聚合
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "gender": {
          "terms": {
            "field":"gender"
          }
        }
      }
    }
    
    
    #指定 bucket 的 size
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "ages_5": {
          "terms": {
            "field":"age",
            "size":3
          }
        }
      }
    }
    
    
    
    # 指定size,不同工种中,年纪最大的3个员工的具体信息
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field":"job.keyword"
          },
          "aggs":{
            "old_employee":{
              "top_hits":{
                "size":3,
                "sort":[
                  {
                    "age":{
                      "order":"desc"
                    }
                  }
                ]
              }
            }
          }
        }
      }
    }
    

    Range & Histogram 聚合

    #Salary Ranges 分桶,可以自己定义 key
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "salary_range": {
          "range": {
            "field":"salary",
            "ranges":[
              {
                "to":10000
              },
              {
                "from":10000,
                "to":20000
              },
              {
                "key":">20000",
                "from":20000
              }
            ]
          }
        }
      }
    }
    
    
    #Salary Histogram,工资010万,以 5000一个区间进行分桶
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "salary_histrogram": {
          "histogram": {
            "field":"salary",
            "interval":5000,
            "extended_bounds":{
              "min":0,
              "max":100000
    
            }
          }
        }
      }
    }
    

    多次嵌套

    # 嵌套聚合1,按照工作类型分桶,并统计工资信息
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "Job_salary_stats": {
          "terms": {
            "field": "job.keyword"
          },
          "aggs": {
            "salary": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
    
    # 多次嵌套。根据工作类型分桶,然后按照性别分桶,计算工资的统计信息
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "Job_gender_stats": {
          "terms": {
            "field": "job.keyword"
          },
          "aggs": {
            "gender_stats": {
              "terms": {
                "field": "gender"
              },
              "aggs": {
                "salary_stats": {
                  "stats": {
                    "field": "salary"
                  }
                }
              }
            }
          }
        }
      }
    }
    

    Pipeline 聚合分析

    Pipeline:min_bucket

    Parent Pipeline:Derivative

    # 平均工资最低的工作类型
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            }
          }
        },
        "min_salary_by_job":{
          "min_bucket": {
            "buckets_path": "jobs>avg_salary"
          }
        }
      }
    }
    
    
    # 平均工资最高的工作类型
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            }
          }
        },
        "max_salary_by_job":{
          "max_bucket": {
            "buckets_path": "jobs>avg_salary"
          }
        }
      }
    }
    
    
    # 平均工资的平均工资
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            }
          }
        },
        "avg_salary_by_job":{
          "avg_bucket": {
            "buckets_path": "jobs>avg_salary"
          }
        }
      }
    }
    
    
    # 平均工资的统计分析
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            }
          }
        },
        "stats_salary_by_job":{
          "stats_bucket": {
            "buckets_path": "jobs>avg_salary"
          }
        }
      }
    }
    
    
    # 平均工资的百分位数
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword",
            "size": 10
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            }
          }
        },
        "percentiles_salary_by_job":{
          "percentiles_bucket": {
            "buckets_path": "jobs>avg_salary"
          }
        }
      }
    }
    
    
    
    #按照年龄对平均工资求导
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "age": {
          "histogram": {
            "field": "age",
            "min_doc_count": 1,
            "interval": 1
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            },
            "derivative_avg_salary":{
              "derivative": {
                "buckets_path": "avg_salary"
              }
            }
          }
        }
      }
    }
    
    
    #Cumulative_sum
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "age": {
          "histogram": {
            "field": "age",
            "min_doc_count": 1,
            "interval": 1
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            },
            "cumulative_salary":{
              "cumulative_sum": {
                "buckets_path": "avg_salary"
              }
            }
          }
        }
      }
    }
    
    #Moving Function
    POST employees/_search
    {
      "size": 0,
      "aggs": {
        "age": {
          "histogram": {
            "field": "age",
            "min_doc_count": 1,
            "interval": 1
          },
          "aggs": {
            "avg_salary": {
              "avg": {
                "field": "salary"
              }
            },
            "moving_avg_salary":{
              "moving_fn": {
                "buckets_path": "avg_salary",
                "window":10,
                "script": "MovingFunctions.min(values)"
              }
            }
          }
        }
      }
    }
    

    相关文章

    Metric Aggregation
    Bucket Aggregationsedit

  • 相关阅读:
    Django--templates(模板层)
    基于 Hive 的文件格式:RCFile 简介及其应用
    Gobblin采集kafka数据
    Scala 中下划线的用途
    Gobblin编译支持CDH5.4.0
    Kafka到Hdfs的数据Pipeline整理
    Hadoop NameNode的ZKFC机制
    Windows下Eclipse提交MR程序到HadoopCluster
    Kettle实现MapReduce之WordCount
    hadoop中MapReduce多种join实现实例分析
  • 原文地址:https://www.cnblogs.com/shuiyj/p/13185078.html
Copyright © 2011-2022 走看看