zoukankan      html  css  js  c++  java
  • Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

    Elasticsearch7.6学习笔记1 Getting start with Elasticsearch

    前言

    权威指南中文只有2.x, 但现在es已经到7.6. 就安装最新的来学下.

    安装

    这里是学习安装, 生产安装是另一套逻辑.

    win

    es下载地址:

    https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-windows-x86_64.zip
    

    kibana下载地址:

    https://artifacts.elastic.co/downloads/kibana/kibana-7.6.0-windows-x86_64.zip
    

    官方目前最新是7.6.0, 但下载速度惨不忍睹. 使用迅雷下载速度可以到xM.

    binelasticsearch.bat
    binkibana.bat
    

    双击bat启动.

    docker安装

    对于测试学习,直接使用官方提供的docker镜像更快更方便。

    安装方法见: https://www.cnblogs.com/woshimrf/p/docker-es7.html

    以下内容来自:

    https://www.elastic.co/guide/en/elasticsearch/reference/7.6/getting-started.html

    Index some documents 索引一些文档

    本次测试直接使用kibana, 当然也可以通过curl或者postman访问localhost:9200.

    访问localhost:5601, 然后点击Dev Tools.

    新建一个客户索引(index)

    PUT /{index-name}/_doc/{id}

    PUT /customer/_doc/1
    {
      "name": "John Doe"
    }
    

    put 是http method, 如果es中不存在索引(index) customer, 则创建一个, 并插入一个数据, id, name=John`.
    如果存在则更新. 注意, 更新是覆盖更新, 即body json是什么, 最终结果就是什么.

    返回如下:

    {
      "_index" : "customer",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 7,
      "result" : "updated",
      "_shards" : {
        "total" : 2,
        "successful" : 2,
        "failed" : 0
      },
      "_seq_no" : 6,
      "_primary_term" : 1
    }
    
    
    • _index 是索引名称
    • _type 唯一为_doc
    • _id 是文档(document)的主键, 也就是一条记录的pk
    • _version 是该_id的更新次数, 我这里已经更新了7次
    • _shards 表示分片的结果. 我们这里一共部署了两个节点, 都写入成功了.

    在kibana上设置-index manangement里可以查看index的状态. 比如我们这条记录有主副两个分片.

    保存记录成功后可以立马读取出来:

    GET /customer/_doc/1
    

    返回

    {
      "_index" : "customer",
      "_type" : "_doc",
      "_id" : "1",
      "_version" : 15,
      "_seq_no" : 14,
      "_primary_term" : 1,
      "found" : true,
      "_source" : {
        "name" : "John Doe"
      }
    }
    
    
    • _source 就是我们记录的内容

    批量插入

    当有多条数据需要插入的时候, 我们可以批量插入. 下载准备好的文档, 然后通过http请求导入es.

    创建一个索引bank: 由于shards(分片)和replicas(副本)创建后就不能修改了,所以要先创建的时候配置shards. 这里配置了3个shards和2个replicas.

    PUT /bank
    {
      "settings": {
        "index": {
          "number_of_shards": "3",
          "number_of_replicas": "2"
        }
      }
    }
    

    文档地址: https://gitee.com/mirrors/elasticsearch/raw/master/docs/src/test/resources/accounts.json

    下载下来之后, curl命令或者postman 发送文件请求过去

    curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
    curl "localhost:9200/_cat/indices?v"
    

    每条记录格式如下:

    {
      "_index": "bank",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "_score": 0,
      "_source": {
        "account_number": 1,
        "balance": 39225,
        "firstname": "Amber",
        "lastname": "Duke",
        "age": 32,
        "gender": "M",
        "address": "880 Holmes Lane",
        "employer": "Pyrami",
        "email": "amberduke@pyrami.com",
        "city": "Brogan",
        "state": "IL"
      }
    }
    

    在kibana monitor中选择self monitor. 然后再indices中找到索引bank。可以看到我们导入的数据分布情况。

    可以看到, 有3个shards分在不同的node上, 并且都有2个replicas.

    开始查询

    批量插入了一些数据后, 我们就可以开始学习查询了. 上文知道, 数据是银行职员表, 我们查询所有用户,并根据账号排序.

    类似 sql

    select * from bank order by  account_number asc limit 3
    

    Query DSL

    
    GET /bank/_search
    {
      "query": { "match_all": {} },
      "sort": [
        { "account_number": "asc" }
      ],
      "size": 3,
      "from": 2
    }
    
    • _search 表示查询
    • query 是查询条件, 这里是所有
    • size 表示每次查询的条数, 分页的条数. 如果不传, 默认是10条. 在返回结果的hits中显示.
    • from表示从第几个开始

    返回:

    
    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1000,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : null,
            "_source" : {
              "account_number" : 2,
              "balance" : 28838,
              "firstname" : "Roberta",
              "lastname" : "Bender",
              "age" : 22,
              "gender" : "F",
              "address" : "560 Kingsway Place",
              "employer" : "Chillium",
              "email" : "robertabender@chillium.com",
              "city" : "Bennett",
              "state" : "LA"
            },
            "sort" : [
              2
            ]
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : null,
            "_source" : {
              "account_number" : 3,
              "balance" : 44947,
              "firstname" : "Levine",
              "lastname" : "Burks",
              "age" : 26,
              "gender" : "F",
              "address" : "328 Wilson Avenue",
              "employer" : "Amtap",
              "email" : "levineburks@amtap.com",
              "city" : "Cochranville",
              "state" : "HI"
            },
            "sort" : [
              3
            ]
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "4",
            "_score" : null,
            "_source" : {
              "account_number" : 4,
              "balance" : 27658,
              "firstname" : "Rodriquez",
              "lastname" : "Flores",
              "age" : 31,
              "gender" : "F",
              "address" : "986 Wyckoff Avenue",
              "employer" : "Tourmania",
              "email" : "rodriquezflores@tourmania.com",
              "city" : "Eastvale",
              "state" : "HI"
            },
            "sort" : [
              4
            ]
          }
        ]
      }
    }
    
    
    
    

    返回结果提供了如下信息

    • took es查询时间, 单位是毫秒(milliseconds)
    • timed_out search是否超时了
    • _shards 我们搜索了多少shards, 成功了多少, 失败了多少, 跳过了多少. 关于shard, 简单理解为数据分片, 即一个index里的数据分成了几片,可以理解为按id进行分表。
    • max_score 最相关的记录(document)的分数

    接下来可可以尝试带条件的查询。

    分词查询

    查询address中带milllane的地址。

    GET /bank/_search
    {
      "query": { "match": { "address": "mill lane" } },
      "size": 2
    }
    

    返回

    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 19,
          "relation" : "eq"
        },
        "max_score" : 9.507477,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "136",
            "_score" : 9.507477,
            "_source" : {
              "account_number" : 136,
              "balance" : 45801,
              "firstname" : "Winnie",
              "lastname" : "Holland",
              "age" : 38,
              "gender" : "M",
              "address" : "198 Mill Lane",
              "employer" : "Neteria",
              "email" : "winnieholland@neteria.com",
              "city" : "Urie",
              "state" : "IL"
            }
          },
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "970",
            "_score" : 5.4032025,
            "_source" : {
              "account_number" : 970,
              "balance" : 19648,
              "firstname" : "Forbes",
              "lastname" : "Wallace",
              "age" : 28,
              "gender" : "M",
              "address" : "990 Mill Road",
              "employer" : "Pheast",
              "email" : "forbeswallace@pheast.com",
              "city" : "Lopezo",
              "state" : "AK"
            }
          }
        ]
      }
    }
    
    
    • 我设置了返回2个,但实际上命中的有19个

    完全匹配查询

    GET /bank/_search
    {
      "query": { "match_phrase": { "address": "mill lane" } }
    }
    

    这时候查的完全符合的就一个了

    {
      "took" : 1,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 9.507477,
        "hits" : [
          {
            "_index" : "bank",
            "_type" : "_doc",
            "_id" : "136",
            "_score" : 9.507477,
            "_source" : {
              "account_number" : 136,
              "balance" : 45801,
              "firstname" : "Winnie",
              "lastname" : "Holland",
              "age" : 38,
              "gender" : "M",
              "address" : "198 Mill Lane",
              "employer" : "Neteria",
              "email" : "winnieholland@neteria.com",
              "city" : "Urie",
              "state" : "IL"
            }
          }
        ]
      }
    }
    

    多条件查询

    实际查询中通常是多个条件一起查询的

    GET /bank/_search
    {
      "query": {
        "bool": {
          "must": [
            { "match": { "age": "40" } }
          ],
          "must_not": [
            { "match": { "state": "ID" } }
          ]
        }
      }
    }
    
    • bool用来合并多个查询条件
    • must, should, must_not是boolean查询的子语句, must, should决定相关性的score,结果默认按照score排序
    • must not是作为一个filter,影响查询的结果,但不影响score,只是从结果中过滤。

    还可以显式地指定任意过滤器,以包括或排除基于结构化数据的文档。

    比如,查询balance在20000和30000之间的。

    GET /bank/_search
    {
      "query": {
        "bool": {
          "must": { "match_all": {} },
          "filter": {
            "range": {
              "balance": {
                "gte": 20000,
                "lte": 30000
              }
            }
          }
        }
      }
    }
    
    

    聚合运算group by

    按照省份统计人数

    按sql的写法可能是

    select state AS group_by_state, count(*) from tbl_bank limit 3;
    

    对应es的请求是

    
    GET /bank/_search
    {
      "size": 0,
      "aggs": {
        "group_by_state": {
          "terms": {
            "field": "state.keyword",
            "size": 3
          }
        }
      }
    }
    
    • size=0是限制返回内容, 因为es会返回查询的记录, 我们只想要聚合值
    • aggs是聚合的语法词
    • group_by_state 是一个聚合结果, 名称自定义
    • terms 查询的字段精确匹配, 这里是需要分组的字段
    • state.keyword state是text类型, 字符类型需要统计和分组的,类型必须是keyword
    • size=3 限制group by返回的数量,这里是top3, 默认top10, 系统最大10000,可以通过修改search.max_buckets实现, 注意多个shards会产生精度问题, 后面再深入学习

    返回值:

    {
      "took" : 5,
      "timed_out" : false,
      "_shards" : {
        "total" : 3,
        "successful" : 3,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1000,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "group_by_state" : {
          "doc_count_error_upper_bound" : 26,
          "sum_other_doc_count" : 928,
          "buckets" : [
            {
              "key" : "MD",
              "doc_count" : 28
            },
            {
              "key" : "ID",
              "doc_count" : 23
            },
            {
              "key" : "TX",
              "doc_count" : 21
            }
          ]
        }
      }
    }
    
    
    
    • hits命中查询条件的记录,因为设置了size=0, 返回[]. total是本次查询命中了1000条记录
    • aggregations 是聚合指标结果
    • group_by_state 是我们查询中命名的变量名
    • doc_count_error_upper_bound 没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思,也就是表示在预估的最坏情况下沒有被算进最终结果的值,当然doc_count_error_upper_bound的值越大,最终数据不准确的可能性越大,能确定的是,它的值为 0 表示数据完全正确,但是它不为 0,不代表这次聚合的数据是错误的.
    • sum_other_doc_count 聚合中没有统计到的文档数

    值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:

    GET /bank/_search
    {
      "size": 0,
      "aggs": {
        "group_by_state": {
          "terms": {
            "field": "state.keyword",
            "size": 3,
            "shard_size":  60
          }
        }
      }
    }
    -----------------------------------------
      "aggregations" : {
        "group_by_state" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 915,
          "buckets" : [
            {
              "key" : "TX",
              "doc_count" : 30
            },
            {
              "key" : "MD",
              "doc_count" : 28
            },
            {
              "key" : "ID",
              "doc_count" : 27
            }
          ]
        }
      }
    
    • shard_size 表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound不为0. es默认shard_size的值是size*1.5+10, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.

    按省份统计人数并计算平均薪酬

    我们想要查看每个省的平均薪酬, sql可能是

    select 
      state, avg(balance) AS average_balance, count(*) AS group_by_state 
    from tbl_bank
    group by state
    limit 3
    

    在es可以这样查询:

    GET /bank/_search
    {
      "size": 0,
      "aggs": {
        "group_by_state": {
          "terms": {
            "field": "state.keyword",
            "size": 3,
            "shard_size":  60
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            },
            "sum_balance": {
              "sum": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
    
    • 第二个aggs是计算每个state的聚合指标
    • average_balance 自定义的变量名称, 值为相同state的balance avg运算
    • sum_balance 自定义的变量名称, 值为相同state的balancesum运算

    结果如下:

    {
      "took" : 12,
      "timed_out" : false,
      "_shards" : {
        "total" : 3,
        "successful" : 3,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1000,
          "relation" : "eq"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "group_by_state" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 915,
          "buckets" : [
            {
              "key" : "TX",
              "doc_count" : 30,
              "sum_balance" : {
                "value" : 782199.0
              },
              "average_balance" : {
                "value" : 26073.3
              }
            },
            {
              "key" : "MD",
              "doc_count" : 28,
              "sum_balance" : {
                "value" : 732523.0
              },
              "average_balance" : {
                "value" : 26161.535714285714
              }
            },
            {
              "key" : "ID",
              "doc_count" : 27,
              "sum_balance" : {
                "value" : 657957.0
              },
              "average_balance" : {
                "value" : 24368.777777777777
              }
            }
          ]
        }
      }
    }
    
    

    按省份统计人数并按照平均薪酬排序

    agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:

    select 
      state, avg(balance) AS average_balance, count(*) AS group_by_state 
    from tbl_bank
    group by state
    order by average_balance
    limit 3
    

    对应es可以这样查询:

    GET /bank/_search
    {
      "size": 0,
      "aggs": {
        "group_by_state": {
          "terms": {
            "field": "state.keyword",
            "order": {
              "average_balance": "desc"
            },
            "size": 3
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
    

    返回结果的top3就不是之前的啦:

      "aggregations" : {
        "group_by_state" : {
          "doc_count_error_upper_bound" : -1,
          "sum_other_doc_count" : 983,
          "buckets" : [
            {
              "key" : "DE",
              "doc_count" : 2,
              "average_balance" : {
                "value" : 39040.5
              }
            },
            {
              "key" : "RI",
              "doc_count" : 5,
              "average_balance" : {
                "value" : 36035.4
              }
            },
            {
              "key" : "NE",
              "doc_count" : 10,
              "average_balance" : {
                "value" : 35648.8
              }
            }
          ]
        }
      }
    

    参考

  • 相关阅读:
    java接入钉钉机器人(带源码)
    使用java做一个能赚钱的微信群聊机器人(2020年基于PC端协议最新可用版)
    侠说java8--Stream流操作学习笔记,都在这里了
    Elasticsearch调优篇-慢查询分析笔记
    网络探测和抓包工具 wireshark
    window10远程ubuntu18.04
    springdataJPA mysql myisam innodb
    命令集
    java tmpdir 启动 kafka 命令行
    java jar 启动命令
  • 原文地址:https://www.cnblogs.com/woshimrf/p/es7-start.html
Copyright © 2011-2022 走看看