zoukankan      html  css  js  c++  java
  • Elasticsearch由浅入深(八)搜索引擎:mapping、精确匹配与全文搜索、分词器、mapping总结

    下面先简单描述一下mapping是什么?

    自动或手动为index中的type建立的一种数据结构和相关配置,简称为mapping
    dynamic mapping,自动为我们建立index,创建type,以及type对应的mapping,mapping中包含了每个field对应的数据类型,以及如何分词等设置

    当我们插入几条数据,让ES自动为我们建立一个索引

    PUT /website/article/1
    {
      "post_date": "2019-08-21",
      "title": "my first article",
      "content": "this is my first article in this website",
      "author_id": 11400
    }
    
    PUT /website/article/2
    {
      "post_date": "2019-08-22",
      "title": "my second article",
      "content": "this is my second article in this website",
      "author_id": 11400
    }
    
    PUT /website/article/3
    {
      "post_date": "2019-08-23",
      "title": "my third article",
      "content": "this is my third article in this website",
      "author_id": 11400
    }

    查看mapping

    GET /website/_mapping
    
    {
      "website": {
        "mappings": {
          "article": {
            "properties": {
              "author_id": {
                "type": "long"
              },
              "content": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "post_date": {
                "type": "date"
              },
              "title": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }

    上面是插入数据自动生成的mapping,还有手动生成的mapping。这种自动或手动为index中的type建立的一种数据结构和相关配置,称为mapping。

    尝试各种搜索

    GET /website/article/_search?q=2019            //3条结果             
    GET /website/article/_search?q=2019-08-21            //3条结果
    GET /website/article/_search?q=post_date:2019-08-21       //1条结果
    GET /website/article/_search?q=post_date:2019         //0条结果

    搜索结果为什么不一致,因为es自动建立mapping的时候,设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。
    下面是手动创建的mapping。

    PUT /test_mapping
    {
      "mappings" : {
        "properties" : {
          "author_id" : {
            "type" : "long"
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "post_date" : {
            "type" : "date"
          },
          "title" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
    View Code

    精确匹配与全文搜索的对比分析

    exact value

    也就是某个field必须全部匹配才能返回相应的document
    示例:

    GET /website/article/_search?q=post_date:2019-08-21       //1条结果
    GET /website/article/_search?q=post_date:2019         //0条结果

    exact value,搜索的时候,必须输入2019-08-21,才能搜索出来
    如果你输入一个21,是搜索不出来的

    full text

    full text与exact value不一样,不是说单纯的只是匹配完整的一个值,而是可以对值进行拆分词语后(分词)进行匹配,也可以通过缩写、时态、大小写、同义词等进行匹配。
    示例:

    GET /website/article/_search?q=2019            //3条结果             
    GET /website/article/_search?q=2019-08-21            //3条结果

    倒排索引核心原理

    下面演示一下倒排索引简单建立的过程,当然实际中倒排索引的建立过程会非常的复杂。
    doc1: I really liked my small dogs, and I think my mom also liked them.
    doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

    分词,初步的倒排索引的建立

    word    doc1    doc2
    I        *        *
    really   *
    liked    *        *
    my       *        *
    small    *
    dogs     *
    and      *
    think    *
    mom      *        *
    also     *        
    them     *
    He                *
    never             *
    any               *
    so                *
    hope              *
    that              *
    will              *
    not               *
    expect            *
    me                *
    to                *
    him               *

    搜索 mother like little dog, 不会有任何结果
    mother
    like 
    little
    dog
    这肯定不是我们想要的结果。比如mother和mom其实根本就没有区别。但是却检索不到。但是做下测试发现ES是可以查到的。实际上ES在建立倒排索引的时候,还会执行一个操作,就是会对拆分的各个单词进行相应的处理,以提升后面搜索的时候能够搜索到相关联的文档的概率。像时态的转换,单复数的转换,同义词的转换,大小写的转换。这个过程称为正则化(normalization)
    mother-> mom
    liked -> like
    small -> little
    dogs -> dog
    这样重新建立倒排索引:

    word    doc1    doc2
    I        *        *
    really   *
    like     *        *
    my       *        *
    little   *
    dog      *
    and      *
    think    *
    mom      *        *
    also     *        
    them     *
    He                *
    never             *
    any               *
    so                *
    hope              *
    that              *
    will              *
    not               *
    expect            *
    me                *
    to                *
    him               *

    查询:mother like little dog 分词正则化
    mother -> mom
    like -> like
    little -> little
    dog -> dog
    doc1和doc2都会搜索出来
    doc1:I really liked my small dogs, and I think my mom also liked them.
    doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.

    分词器

    切分词语,normalization(提升recall召回率)

    给你一段句子,然后将这段句子拆分成一个一个的单个的单词,同时对每个单词进行normalization(时态转换,单复数转换),分瓷器
    recall,召回率:搜索的时候,增加能够搜索到的结果的数量

    • character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)
    • tokenizer:分词,hello you and me --> hello, you, and, me
    • token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little

    一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引

    内置分词器的介绍:

    待分词:Set the shape to semi-transparent by calling set_trans(5)
    
    standard analyzerset, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是standard)
    simple analyzerset, the, shape, to, semi, transparent, by, calling, set, trans
    whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5

    mapping引入案例遗留问题大揭秘

    GET /_search?q=2019

    搜索的是_all field,document所有的field都会拼接成一个大串,进行分词

    2019-01-02 my second article this is my second article in this website 11400

            doc1        doc2        doc3
    2019      *          *           *
    01        *         
    02                   *
    03                               *

    _all,2017,自然会搜索到3个docuemnt

    GET /_search?q=post_date:2019-01-01

    date,会作为exact value去建立索引

                 doc1        doc2        doc3
    2017-01-01    *        
    2017-01-02                 *         
    2017-01-03                             *

    测试分词器

    语法:

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Text to analyze"
    }
    {
      "tokens": [
        {
          "token": "text",
          "start_offset": 0,
          "end_offset": 4,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "to",
          "start_offset": 5,
          "end_offset": 7,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "analyze",
          "start_offset": 8,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

    对mapping进一步总结

    1. 往ES里面直接插入数据,ES会自动建立索引,同时建立type以及对应的mapping
    2. mapping中自动定义了每个fieldd的数据类型
    3. 不同的数据类型(比如说text和date),可能有的是exact value,有的是full text
    4. exact value,在建立倒排索引的时候,分词的时候,都是将整个值一起作为关键字建立到倒排索引中;full text会经历各种各样的处理,分词,normalization(时态转换,同义词转换,大小写转换),才会建立到倒排索引中
    5. 在搜索的时候,exact value和full text类型就决定了,对exact value和full text field进行搜索的行为也是不一样的,会跟建立倒排索引的行为保持一致;比如说exact value搜索的时候,就是直接按照整个值进行匹配,full text也会进行分词和正则化normalization再去倒排索引中去搜索。
    6. 可以用 ES的dynamic mapping,让其自动建立mapping,包括自动设置数据类型;也可以提前手动创建index和type的mapping,自己对各个field进行设置,包括数据类型,包括索引行为,包括分析器等等。

    mapping本质上就是index的type的元数据,决定了数据类型,建立倒排索引的行为,还有进行搜索的行为。

    mapping核心数据类型以及dynamic mapping

    • 核心数据类型
      string text:字符串类型
      byte:字节类型
      short:短整型
      integer:整型
      long:长整型
      float:浮点型
      boolean:布尔类型
      date:时间类型

      当然还有一些高级类型,像数组,对象object,但其底层都是text字符串类型

    • dynamic mapping
      true or false -> boolean
      123 -> long
      123.45 -> float
      2017-01-01 -> date
      "hello world" -> string text
    • 查看mapping

      语法:
      GET /{index}/_mapping
      GET /{index}/_mapping/{type}

    手动建立和修改mapping以及定制string类型是否分词

    注意:只能创建index时手动建立mapping,或者新增field mapping,但是不能update field mapping。

    • "analyzer": "standard":自动分词
    • date:日期
    • keyword:不分词
    # 创建索引
    PUT /website
    {
      "mappings": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "title": {
            "type": "text",
            "analyzer": "standard"
          },
          "content": {
            "type": "text"
          },
          "post_date": {
            "type": "date"
          },
          "publisher_id": {
            "type": "keyword"
          }
        }
      }
    }
    
    
    #修改字段的mapping
    PUT /website
    {
      "mappings": {
        "properties": {
          "author_id": {
            "type": "text"
          }
        }
      }
    }
    
    {
      "error": {
        "root_cause": [
          {
            "type": "resource_already_exists_exception",
            "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
            "index_uuid": "5xLohnJITHqCwRYInmBFmA",
            "index": "website"
          }
        ],
        "type": "resource_already_exists_exception",
        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
        "index_uuid": "5xLohnJITHqCwRYInmBFmA",
        "index": "website"
      },
      "status": 400
    }
    
    
    #增加mapping的字段
    PUT /website/_mapping
    {
      "properties": {
        "new_field": {
          "type": "text"
        }
      }
    }
    
    {
      "acknowledged" : true
    }

    mapping复杂类型y以及object类型数据底层结构

    1. multivalue field
      {
          "tags": ["tag1", "tag2"]
      }

      建立索引时与string是一样的,数据类型不能混

    2. empty field
      null,[],[null]
    3. object field
      初始化数据:
      PUT /company/employee/1
      {
        "address": {
          "country": "china",
          "province": "guangdong",
          "city": "guangzhou"
        },
        "name": "jack",
        "age": 27,
        "join_date": "2017-01-01"
      }

      查看mapping

      GET /company/_mapping/employee
      {
        "company": {
          "mappings": {
            "employee": {
              "properties": {
                "address": {
                  "properties": {
                    "city": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "country": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "province": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "age": {
                  "type": "long"
                },
                "join_date": {
                  "type": "date"
                },
                "name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            }
          }
        }
      }
      View Code

      object field底层解析

      {
        "address": {
          "country": "china",
          "province": "guangdong",
          "city": "guangzhou"
        },
        "name": "jack",
        "age": 27,
        "join_date": "2017-01-01"
      }

      ↓↓↓↓

      {
          "name":            [jack],
          "age":          [27],
          "join_date":      [2017-01-01],
          "address.country":         [china],
          "address.province":   [guangdong],
          "address.city":  [guangzhou]
      }
      {
          "authors": [
              { "age": 26, "name": "Jack White"},
              { "age": 55, "name": "Tom Jones"},
              { "age": 39, "name": "Kitty Smith"}
          ]
      }

      ↓↓↓↓

      {
          "authors.age":    [26, 55, 39],
          "authors.name":   [jack, white, tom, jones, kitty, smith]
      }
  • 相关阅读:
    Rotation Kinematics
    离职 mark
    PnP 问题方程怎么列?
    DSO windowed optimization 代码 (4)
    Adjoint of SE(3)
    IMU 预积分推导
    DSO windowed optimization 代码 (3)
    DSO windowed optimization 代码 (2)
    OKVIS 代码框架
    DSO windowed optimization 代码 (1)
  • 原文地址:https://www.cnblogs.com/wyt007/p/11396510.html
Copyright © 2011-2022 走看看