zoukankan      html  css  js  c++  java
  • Elastisearch笔记

    es 和 关系型数据库的简单对比

    RDBMS Elasticsearch
    Table Index(Type)
    Row Doucment
    Column Filed
    Schema Mapping
    SQL DSL
    ## 索引相关信息
    GET kibana_sample_data_ecommerce
    
    ## 文档总数
    GET kibana_sample_data_ecommerce/_count
    
    ## _cat indices API
    ## 模糊匹配
    GET /_cat/indices/kibana_*
    ## 按照文档个数排序
    GET /_cat/indices?v&s=docs.count:desc
    ## 查看文档的一些基本信息
    GET /_cat/indices/kibana_sample_data_ecommerce?v
    

    集群的名字默认为 elasticsearch

    分片分为 Primary Shard & Replica Shard

    创建分片索引时指定主分片数,后续不允许修改,除非 Reindex

    副本分片数量可以动态调整

    ## 集群健康状况
    GET _cluster/health
    
    GET _cat/nodes?v
    GET _cat/shards?v
    
    index                        shard prirep state   docs   store ip         node
    .apm-agent-configuration     0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
    .kibana_1                    0     p      STARTED   94 967.7kb 172.18.0.2 12b52a46e43f
    kibana_sample_data_ecommerce 0     p      STARTED 4675   4.5mb 172.18.0.2 12b52a46e43f
    .apm-custom-link             0     p      STARTED    0    208b 172.18.0.2 12b52a46e43f
    .kibana_task_manager_1       0     p      STARTED    5  55.2kb 172.18.0.2 12b52a46e43f
    

    简单的 CRUD

    ## 自动生成id
    POST my_index/_doc/
    {
      "user":"xiaoting",
      "comment":"you know for search"
    }
    
    ## 用户指定id,多次 PUT 会更新 version
    PUT my_index/_doc/2
    {
      "user":"xiaoting",
      "comment":"you know for search"
    }
    
    ## 读取
    GET my_index/_doc/2
    
    ## 查询
    GET my_index/_search
    {
      "query":{
        "match_all":{}
      }
    }
    
    ## 在原文档上面增加字段,如果用 put,就必须全部指定,不然会缺失字段
    POST my_index/_update/2
    {
      "doc":{
        "post_date":"2020-05-21"
      }
    }
    
    ## 删除
    DELETE my_index/_doc/2
    
    ## 批量读取
    GET _mget
    {
      "docs": [
        {
          "_index": "my_index",
          "_id": 1
        },
        {
          "_index": "my_index",
          "_id": 2
        }
      ]
    }
    

    倒排索引

    正排索引——目录页

    倒排索引——索引页

    分词器 Analysis

    三部分组成

    Character Filters Tokenizer Token Filters

    ## 直接指定 Analysis 进行分词
    GET /_analyze
    {
      "analyzer": "standard",
      "text": "liuchenglong is a student"
    }
    
    ## 指定索引的字段进行分词,可以模拟分词器对该字段是合种分词结果
    GET my_index/_analyze
    {
      "field": "user",
      "text": "xiaoting"
    }
    
    ## 自定义分词器进行分词
    GET /_analyze
    {
      "tokenizer": "standard",
      "filter": [
        "lowercase"
      ],
      "text": "liuchenglong is a student"
    }
    

    Standard Analyzer 是默认的分词器

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "simple",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "whitespace",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "stop",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "keyword",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "pattern",
      "text": "Liuchenglong in the house"
    }
    
    GET /_analyze
    {
      "analyzer": "english",
      "text": "Liuchenglong in the house"
    }
    
    ## 中文分词器插件 ik(需要额外安装下载)
    GET /_analyze
    {
      "analyzer": "ik_max_word",
      "text": "江苏省无锡市滨湖区溪北新村"
    }
    
    GET /_analyze
    {
      "analyzer": "ik_smart",
      "text": "江苏省无锡市滨湖区溪北新村"
    }
    

    Search API

    1.URL Search,使用 q 指定查询字符串

    2.Request Body Search,使用 get 或者 post,可以在请求体中使用 es 的 DSL 语法

    /_search
    /index1/_search
    /index1,index2/_search
    /index*/_search
    

    URL Search

    ## q 指定查询内容,df 指定查询的字段
    GET my_index/_search?q=chenglong&df=user
    GET my_index/_search?q=user:chenglong
    
    ## 带上 profile:true 可以查看这次查询的计算方式
    GET my_index/_search?q=chenglong&df=user
    {
      "profile": "true"
    }
    
    ## PhraseQuery
    GET my_index/_search?q=comment:"you know"
    ## BooleanQuery
    GET my_index/_search?q=comment:you know
    ## term query,要用()将其包裹
    GET my_index/_search?q=comment:(you know)
    ## "comment:you comment:and comment:know"
    GET my_index/_search?q=comment:(you and know)
    ## comment:you comment:not comment:know"
    GET my_index/_search?q=comment:(you not know)
    ## "comment:you +comment:know"   %2B 就是 + 号
    GET my_index/_search?q=comment:(you %2Bknow)
    ## 范围查询
    GET my_index/_search?q=year>2020
    ## 通配符查询
    GET my_index/_search?q=user:ch*
    ## 模糊匹配,可以匹配上 chenglong
    GET my_index/_search?q=user:chengleng~1
    ## 可以查询出 you know for search
    GET my_index/_search?q=comment:"you for"~2
    

    Request Body Search

    ## 分页查询
    GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "from": 0,
      "size": 20
    }
    
    ## 按照指定字段排序
    GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "sort": [
        {"_score": {"order": "desc"}}
      ]
    }
    
    ## 只查询指定的字段
    GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "_source": ["user"]
    }
    
    ## matchQuery TermQuery
    GET my_index/_search
    {
      "query": {
        "match": {
          "user":"Chenglong"
        }
      }
    }
    
    ## 指定查询方式
    GET my_index/_search
    {
      "query": {
        "match": {
          "user":{
            "query": "Chenglong",
            "operator": "and"
          }
        }
      }
    }
    
    ## match_phrase 可以指定模糊几个单词,下面的查询可以查询出 you know for search
    GET my_index/_search
    {
      "query": {
        "match_phrase": {
          "comment":{
            "query": "you for",
            "slop": 1
          }
        }
      }
    }
    

    脚本字段

    GET my_index/_search
    {
      "query": {
        "match_all": {}
      },
      "script_fields": {
        "userName": {
          "script": {
            "lang": "painless",
            "source": "doc['user'].value + 's'"
          }
        }
      }
    }
    

    Mapping

    有点类似数据库中的 schema 的定义。

    • 简单类型

    Text / Keyword

    Date

    Integer / Floating

    Boolean

    IPv4 & IPv6

    • 复杂类型 - 对象和嵌套对象

    对象类型 / 嵌套类型

    • 特殊类型

    geo_point & geo_shape / percolator

    Dynamic Mapping

    在写入文档的时候,如果索引不存在,会自动创建索引

    ## 查看 mapping
    GET my_index/_mapping
    

    如果字段已经存在,则不允许修改字段的类型,必须使用 Reindex API 进行重建

    ## 可以在创建 index 的时候指定 mappings 的额类型,默认为 true
    PUT movies
    {
      "mappings": {
        "_doc": {
          "dynamic": "true | false | strict"
        }
      }
    }
    

    自定义 Mapping

    ## 创建一个 index,其中 mobile 不进行索引
    PUT movies
    {
      "mappings": {
        "properties": {
          "firstName": {
            "type": "text"
          },
          "lastName": {
            "type": "text"
          },
          "mobile": {
            "type": "text",
            "index": false
          }
        }
      }
    }
    
    ## 插入数据
    PUT movies/_doc/1
    {
      "firstName": "Liu",
      "lastName": "Chenglong",
      "mobile": "1234567890"
    }
    
    ## 尝试查询会报错
    ## failed to create query: Cannot search on field [mobile] since it is not indexed.
    POST /movies/_search
    {
      "query": {
        "match": {
          "mobile": "123"
        }
      }
    }
    
    ## null_value
    PUT movies
    {
      "mappings": {
        "properties": {
          "firstName": {
            "type": "text"
          },
          "lastName": {
            "type": "text"
          },
          "mobile": {
            "type": "keyword",
            "null_value": "NULL"
          }
        }
      }
    }
    
    PUT movies/_doc/1
    {
      "firstName": "Liu",
      "lastName": "Chenglong",
      "mobile": null
    }
    
    PUT movies/_doc/2
    {
      "firstName": "Liu",
      "lastName": "Chenglong2"
    }
    
    ## 可以搜索到 mobile 是 null 的数据,但是搜索不到没有 mobile 的数据
    POST /movies/_search
    {
      "query": {
        "match": {
          "mobile": "NULL"
        }
      }
    }
    
    ## copy to
    PUT movies
    {
      "mappings": {
        "properties": {
          "firstName": {
            "type": "text",
            "copy_to": "fullName"
          },
          "lastName": {
            "type": "text",
            "copy_to": "fullName"
          }
        }
      }
    }
    
    PUT movies/_doc/1
    {
      "firstName": "Liu",
      "lastName": "Chenglong"
    }
    
    ## 可以直接查询 fullName,虽然 movies 里面并没有这个字段
    ## _source 中并没有 fullName
    POST movies/_search
    {
      "query": {
        "match": {
          "fullName": "chenglong"
        }
      }
    }
    

    数组类型本身是 text,所以如果原来一个字段是 text,那么可以直接插入一个数组

    PUT movies/_doc/1
    {
      "firstName": "Liu",
      "lastName": "Chenglong"
    }
    
    PUT movies/_doc/3
    {
      "firstName": "Liu",
      "lastName": ["Chenglong"]
    }
    

    多字段属性

    • 实现名字精确查询匹配

    增加一个 keyword 字段

    • 使用不同的 analyzer

    Exact Value(不需要进行分词处理)

    包括 日期、数字、具体的一个字符串(Apple Store)

    Full Text

    es 中的 text

    Character Filters

    可以在 Tokenizer 之前对文本进行处理,例如增加删除、替换文本

    ## 可以去除文本中的 html 标签,可以处理网络爬虫爬出来的数据
    GET _analyze
    {
      "tokenizer": "keyword",
      "char_filter": [
        "html_strip"
      ],
      "text": "<b>hello world</b>"
    }
    
    ## 替换文字
    GET _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
        {
          "type": "mapping",
          "mappings": [
            "- => _"
          ]
        }
      ],
      "text": "hello-world"
    }
    
    ## 按照路径进行分词
    GET _analyze
    {
      "tokenizer": "path_hierarchy",
      "text": "user/local/nginx/conf"
    }
    
    ## 按照空格进行分词,并且去除一些副词进行过滤
    ## 这里只能查询出 You house
    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["stop"], 
      "text": "You are in the house."
    }
    
    ## 添加一个 lowercase 的 filter,就可以将单词变成小写
    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": [
        "stop",
        "lowercase"
      ],
      "text": "You are in the house."
    }
    

    聚合搜索 Aggregation

    Bucket 一些满足结果的文档集合

    Metric 进行数学运算

    Pipeline 对其他聚合结果进行二次聚合

    Matrix 支持多个字段操作并提供一个结果矩阵

    Bucket 有些像 SQL 中的 group

    Metric 有些像 SQL 中的聚合函数

    ## 性别统计
    GET kibana_sample_data_ecommerce/_search
    {
      "size": 0,
      "aggs": {
        "flight_dest": {
          "terms": {
            "field": "customer_gender"
          }
        }
      }
    }
    
    ## 查询结果
    "buckets" : [
      {
        "key" : "FEMALE",
        "doc_count" : 2433
      },
      {
        "key" : "MALE",
        "doc_count" : 2242
      }
    ]
    
    ## 对分组结果继续进行分组
    GET kibana_sample_data_ecommerce/_search
    {
      "size": 0,
      "aggs": {
        "flight_dest": {
          "terms": {
            "field": "day_of_week"
          },
          "aggs": {
            "avg_price": {
              "avg": {
                "field": "products.base_price"
              }
            }
          }
        }
      }
    }
    

    查询

    Term 是表达语义的最小单位

    ## 添加几条数据
    POST /product/_doc/1
    {
      "productId":"XHDK-12-#f",
      "desc":"iPhone"
    }
    POST /product/_doc/2
    {
      "productId":"BHDK-22-#f",
      "desc":"iPad"
    }
    POST /product/_doc/3
    {
      "productId":"CHDK-32-#f",
      "desc":"MBP"
    }
    
    ## 由于 term 不会对搜索进行处理,而插入的数据会被分词,iPhone => iphone
    ## 所以这里查询不到任何数据
    POST /product/_search
    {
      "query": {
        "term": {
          "desc": {
            "value": "iPhone"
            "value": "iphone" ## 这样才能查询出来
          }
        }
      }
    }
    
    ## 这样也可以查询出来
    POST /product/_search
    {
      "query": {
        "term": {
          "desc.keyword": {
            "value": "iPhone"
          }
        }
      }
    }
    
    ## 分词
    POST /_analyze
    {
      "analyzer": "standard",
      "text": ["iPhone"]
    }
    
    {
      "tokens" : [
        {
          "token" : "iphone",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "<ALPHANUM>",
          "position" : 0
        }
      ]
    }
    
    ## 将 Query 转换为 Filter,可以忽略算分的计算,避免不必要的开销
    ## Filter 可以有效的使用缓存,调高多次的查询效率
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "term": {
              "desc.keyword": "iPhone"
            }
          },
          "boost": 1.2
        }
      }
    }
    

    Match Query / Match Phrase Query / Query String Query

    索引和搜索时会进行分词,查询时先分词然后再生成一个供查询的词项列表

    POST movies/_search
    {
      "query": {
        "match": {
          "name": "chenglong"
        }
      }
    }
    

    结构化搜索

    日期、布尔类型、数字都是结构化的数据

    可以用 Term、Prefix前缀查询

    ## 添加一些数据
    POST /product/_bulk
    { "index":{"_id":1}}
    {"price":10,"avaliable":true,"date":"2020-05-22","productId":"XXX-1","tag":"one"}
    { "index":{"_id":2}}
    {"price":20,"avaliable":false,"date":"2019-05-22","productId":"XXX-2","tag":["one","two"]}
    { "index":{"_id":3}}
    {"price":30,"avaliable":false,"productId":"XXX-3"}
    
    ## term 查询 boolean
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "term": {
              "avaliable": true
            }
          }
        }
      }
    }
    
    ## range 查询 数字
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "range": {
              "price": {
                "gte": 10,
                "lte": 20
              }
            }
          }
        }
      }
    }
    
    ## range 查询 日期
    y 年
    M 月
    w 周
    d 天
    H/h 小时
    m 分钟
    s 秒
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "range": {
              "date": {
                "gte": "now-1y"
              }
            }
          }
        }
      }
    }
    
    ## 通过 exists 查询字段存在的数据
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "exists": {
              "field": "date"
            }
          }
        }
      }
    }
    
    ## term 对多字段查询是包含关系,而不是精确匹配
    ## 这样会查询出 one 和 one two 两条数据
    POST /product/_search
    {
      "query": {
        "constant_score": {
          "filter": {
            "term": {
              "tag.keyword": "one"
            }
          }
        }
      }
    }
    
    ## 只想查询出 one
    ## 增加一个 tag_count 字段,再结合 bool query 进行查询
    

    搜索的相关性算分

    TF-IDF

    BM25

    在查询中添加 "explan": true 可以在结果中查询分数的计算方式

    bool Query

    must 必须匹配,贡献算分

    should 选择性匹配,贡献算分

    must_not 必须不匹配

    filter 必须匹配,不贡献算分

    bool 查询可以嵌套

    通过修改嵌套结构,可以影响算分

    ## 可以通过 boost 修改得分
    ## 通过修改 tag 和 price 的字段得分,会影响最后查询出来结果的顺序
    POST /product/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "tag": {
                  "query": "one",
                  "boost": 1
                }
              }
            },
            {
              "match": {
                "price": {
                  "query": "30",
                  "boost": 1
                }
              }
            }
          ]
        }
      }
    }
    
    ## 使用 boosting 可以提升某个值的分数、降低某个值的分数
    POST /product/_search
    {
      "query": {
        "boosting": {
          "positive": {
            "match": {
              "tag": "one"
            }
          },
          "negative": {
             "match": {
              "tag": "two"
            }
          },
          "negative_boost": 0.2
        }
      }
    }
    

    单字符串多字段

    POST /product/_bulk
    { "index":{"_id":1}}
    {"title":"Quick brown rabbits","body":"Brown rabbits are commonly seen"}
    { "index":{"_id":2}}
    {"title":"Keeping pets healthy","body":"My quick brown fox eats rabbits on a regular basis"}
    
    POST /product/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "title": "Brown fox"
              }
            },
            {
              "match": {
                "body": "Brown fox"
              }
            }
          ]
        }
      }
    }
    
    POST /product/_search
    {
      "query": {
        "dis_max": {
          "queries": [
            {
              "match": {
                "title": "Quick fox"
              }
            },
            {
              "match": {
                "body": "Quick fox"
              }
            }
          ]
        }
      }
    }
    
    ## 如果查询出来有评分相同的,可以添加一个 tie_breaker 系数,让评分产生差异
    ## tie_breaker 是一个介于 0-1 之间的浮点数
    ## 0 表示使用最佳匹配
    ## 1 表示所有语句同等重要
    POST /product/_search
    {
      "query": {
        "dis_max": {
          "queries": [
            {
              "match": {
                "title": "Quick pets"
              }
            },
            {
              "match": {
                "body": "Quick pets"
              }
            }
          ],
          "tie_breaker": 0.7
        }
      }
    }
    

    multi_match 查询

    //LCLTODO 整个还不是很理解

    POST /product/_search
    {
      "query": {
        "multi_match": {
          "query": "brown",
          "fields": ["title","body"]
        }
      }
    }
    

    中文分词器

    hanlp

    icu

    ik

    pingyin

    Search Template

    解耦

    ## 创建一个 search template
    POST _scripts/queryProduct
    {
      "script": {
        "lang": "mustache",
        "source": {
          "query": {
            "multi_match": {
              "query": "{{q}}",
              "fields": [
                "title"
              ]
            }
          }
        }
      }
    }
    
    GET _scripts/queryProduct
    
    ## 使用 template 进行查询
    POST product/_search/template
    {
      "id":"queryProduct",
      "params": {
        "q":"pets"
      }
    }
    

    Funcation Score Query

    可以在查询结束后,对每一个匹配的文档进行一系列的重新算分,根据新生成的分数进行排序

    默认的几种排序方式:

    1. Weight 为每个文档设置一个简单而不规范化的权重

    2. Field Value Factor 使用该数值修改 _score

    3. Random Score

    4. 衰减函数 以某个字段的值作为标准,距离某个值越近,得分越高

    5. Script Score 自定义脚本完全控制得分逻辑

    PUT shop/_doc/1
    {
      "title": "Apple pie",
      "price": 8
    }
    
    PUT shop/_doc/2
    {
      "title": "Orange pie",
      "price": 3
    }
    
    PUT shop/_doc/1
    {
      "title": "Watermelon pie",
      "price": 6
    }
    
    POST /shop/_search
    {
      "query": {
        "function_score": {
          "query": {
            "multi_match": {
              "query": "e",
              "fields": "title"
            }
          },
          "field_value_factor": {
            "field": "price"
          }
        }
      }
    }
    
  • 相关阅读:
    中缀表达式转换为后缀表达式
    看4S员工自爆!黑啊,太黑了
    解剖孩子晚上磨牙的6大原因
    2D 3D IMAX 电影座位选择
    蒸鸡蛋羹
    0010 4S店提车注意事项
    2012年北京市车船税基准税额
    火车票预订 电话 和 网站
    远程计算机关机方法
    win7 用户信息丢失
  • 原文地址:https://www.cnblogs.com/manastudent/p/12952528.html
Copyright © 2011-2022 走看看