zoukankan      html  css  js  c++  java
  • Elasticsearch由浅入深(八)搜索引擎:mapping、精确匹配与全文搜索、分词器、mapping总结

    下面先简单描述一下mapping是什么?

    自动或手动为index中的type建立的一种数据结构和相关配置,简称为mapping
    dynamic mapping,自动为我们建立index,创建type,以及type对应的mapping,mapping中包含了每个field对应的数据类型,以及如何分词等设置

    当我们插入几条数据,让ES自动为我们建立一个索引

    PUT /website/article/1
    {
      "post_date": "2019-08-21",
      "title": "my first article",
      "content": "this is my first article in this website",
      "author_id": 11400
    }
    
    PUT /website/article/2
    {
      "post_date": "2019-08-22",
      "title": "my second article",
      "content": "this is my second article in this website",
      "author_id": 11400
    }
    
    PUT /website/article/3
    {
      "post_date": "2019-08-23",
      "title": "my third article",
      "content": "this is my third article in this website",
      "author_id": 11400
    }

    查看mapping

    GET /website/_mapping
    
    {
      "website": {
        "mappings": {
          "article": {
            "properties": {
              "author_id": {
                "type": "long"
              },
              "content": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "post_date": {
                "type": "date"
              },
              "title": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }

    上面是插入数据自动生成的mapping,还有手动生成的mapping。这种自动或手动为index中的type建立的一种数据结构和相关配置,称为mapping。

    尝试各种搜索

    GET /website/article/_search?q=2019            //3条结果             
    GET /website/article/_search?q=2019-08-21            //3条结果
    GET /website/article/_search?q=post_date:2019-08-21       //1条结果
    GET /website/article/_search?q=post_date:2019         //0条结果

    搜索结果为什么不一致,因为es自动建立mapping的时候,设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。
    下面是手动创建的mapping。

    PUT /test_mapping
    {
      "mappings" : {
        "properties" : {
          "author_id" : {
            "type" : "long"
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "post_date" : {
            "type" : "date"
          },
          "title" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
    View Code

    精确匹配与全文搜索的对比分析

    exact value

    也就是某个field必须全部匹配才能返回相应的document
    示例:

    GET /website/article/_search?q=post_date:2019-08-21       //1条结果
    GET /website/article/_search?q=post_date:2019         //0条结果

    exact value,搜索的时候,必须输入2019-08-21,才能搜索出来
    如果你输入一个21,是搜索不出来的

    full text

    full text与exact value不一样,不是说单纯的只是匹配完整的一个值,而是可以对值进行拆分词语后(分词)进行匹配,也可以通过缩写、时态、大小写、同义词等进行匹配。
    示例:

    GET /website/article/_search?q=2019            //3条结果             
    GET /website/article/_search?q=2019-08-21            //3条结果

    倒排索引核心原理

    下面演示一下倒排索引简单建立的过程,当然实际中倒排索引的建立过程会非常的复杂。
    doc1: I really liked my small dogs, and I think my mom also liked them.
    doc2: He never liked any dogs, so I hope that my mom will not expect me to liked him.

    分词,初步的倒排索引的建立

    word    doc1    doc2
    I        *        *
    really   *
    liked    *        *
    my       *        *
    small    *
    dogs     *
    and      *
    think    *
    mom      *        *
    also     *        
    them     *
    He                *
    never             *
    any               *
    so                *
    hope              *
    that              *
    will              *
    not               *
    expect            *
    me                *
    to                *
    him               *

    搜索 mother like little dog, 不会有任何结果
    mother
    like 
    little
    dog
    这肯定不是我们想要的结果。比如mother和mom其实根本就没有区别。但是却检索不到。但是做下测试发现ES是可以查到的。实际上ES在建立倒排索引的时候,还会执行一个操作,就是会对拆分的各个单词进行相应的处理,以提升后面搜索的时候能够搜索到相关联的文档的概率。像时态的转换,单复数的转换,同义词的转换,大小写的转换。这个过程称为正则化(normalization)
    mother-> mom
    liked -> like
    small -> little
    dogs -> dog
    这样重新建立倒排索引:

    word    doc1    doc2
    I        *        *
    really   *
    like     *        *
    my       *        *
    little   *
    dog      *
    and      *
    think    *
    mom      *        *
    also     *        
    them     *
    He                *
    never             *
    any               *
    so                *
    hope              *
    that              *
    will              *
    not               *
    expect            *
    me                *
    to                *
    him               *

    查询:mother like little dog 分词正则化
    mother -> mom
    like -> like
    little -> little
    dog -> dog
    doc1和doc2都会搜索出来
    doc1:I really liked my small dogs, and I think my mom also liked them.
    doc2:He never liked any dogs, so I hope that my mom will not expect me to liked him.

    分词器

    切分词语,normalization(提升recall召回率)

    给你一段句子,然后将这段句子拆分成一个一个的单个的单词,同时对每个单词进行normalization(时态转换,单复数转换),分瓷器
    recall,召回率:搜索的时候,增加能够搜索到的结果的数量

    • character filter:在一段文本进行分词之前,先进行预处理,比如说最常见的就是,过滤html标签(<span>hello<span> --> hello),& --> and(I&you --> I and you)
    • tokenizer:分词,hello you and me --> hello, you, and, me
    • token filter:lowercase,stop word,synonymom,dogs --> dog,liked --> like,Tom --> tom,a/the/an --> 干掉,mother --> mom,small --> little

    一个分词器,很重要,将一段文本进行各种处理,最后处理好的结果才会拿去建立倒排索引

    内置分词器的介绍:

    待分词:Set the shape to semi-transparent by calling set_trans(5)
    
    standard analyzerset, the, shape, to, semi, transparent, by, calling, set_trans, 5(默认的是standard)
    simple analyzerset, the, shape, to, semi, transparent, by, calling, set, trans
    whitespace analyzer:Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    language analyzer(特定的语言的分词器,比如说,english,英语分词器):set, shape, semi, transpar, call, set_tran, 5

    mapping引入案例遗留问题大揭秘

    GET /_search?q=2019

    搜索的是_all field,document所有的field都会拼接成一个大串,进行分词

    2019-01-02 my second article this is my second article in this website 11400

            doc1        doc2        doc3
    2019      *          *           *
    01        *         
    02                   *
    03                               *

    _all,2017,自然会搜索到3个docuemnt

    GET /_search?q=post_date:2019-01-01

    date,会作为exact value去建立索引

                 doc1        doc2        doc3
    2017-01-01    *        
    2017-01-02                 *         
    2017-01-03                             *

    测试分词器

    语法:

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Text to analyze"
    }
    {
      "tokens": [
        {
          "token": "text",
          "start_offset": 0,
          "end_offset": 4,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "to",
          "start_offset": 5,
          "end_offset": 7,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "analyze",
          "start_offset": 8,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 2
        }
      ]
    }

    对mapping进一步总结

    1. 往ES里面直接插入数据,ES会自动建立索引,同时建立type以及对应的mapping
    2. mapping中自动定义了每个fieldd的数据类型
    3. 不同的数据类型(比如说text和date),可能有的是exact value,有的是full text
    4. exact value,在建立倒排索引的时候,分词的时候,都是将整个值一起作为关键字建立到倒排索引中;full text会经历各种各样的处理,分词,normalization(时态转换,同义词转换,大小写转换),才会建立到倒排索引中
    5. 在搜索的时候,exact value和full text类型就决定了,对exact value和full text field进行搜索的行为也是不一样的,会跟建立倒排索引的行为保持一致;比如说exact value搜索的时候,就是直接按照整个值进行匹配,full text也会进行分词和正则化normalization再去倒排索引中去搜索。
    6. 可以用 ES的dynamic mapping,让其自动建立mapping,包括自动设置数据类型;也可以提前手动创建index和type的mapping,自己对各个field进行设置,包括数据类型,包括索引行为,包括分析器等等。

    mapping本质上就是index的type的元数据,决定了数据类型,建立倒排索引的行为,还有进行搜索的行为。

    mapping核心数据类型以及dynamic mapping

    • 核心数据类型
      string text:字符串类型
      byte:字节类型
      short:短整型
      integer:整型
      long:长整型
      float:浮点型
      boolean:布尔类型
      date:时间类型

      当然还有一些高级类型,像数组,对象object,但其底层都是text字符串类型

    • dynamic mapping
      true or false -> boolean
      123 -> long
      123.45 -> float
      2017-01-01 -> date
      "hello world" -> string text
    • 查看mapping

      语法:
      GET /{index}/_mapping
      GET /{index}/_mapping/{type}

    手动建立和修改mapping以及定制string类型是否分词

    注意:只能创建index时手动建立mapping,或者新增field mapping,但是不能update field mapping。

    • "analyzer": "standard":自动分词
    • date:日期
    • keyword:不分词
    # 创建索引
    PUT /website
    {
      "mappings": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "title": {
            "type": "text",
            "analyzer": "standard"
          },
          "content": {
            "type": "text"
          },
          "post_date": {
            "type": "date"
          },
          "publisher_id": {
            "type": "keyword"
          }
        }
      }
    }
    
    
    #修改字段的mapping
    PUT /website
    {
      "mappings": {
        "properties": {
          "author_id": {
            "type": "text"
          }
        }
      }
    }
    
    {
      "error": {
        "root_cause": [
          {
            "type": "resource_already_exists_exception",
            "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
            "index_uuid": "5xLohnJITHqCwRYInmBFmA",
            "index": "website"
          }
        ],
        "type": "resource_already_exists_exception",
        "reason": "index [website/5xLohnJITHqCwRYInmBFmA] already exists",
        "index_uuid": "5xLohnJITHqCwRYInmBFmA",
        "index": "website"
      },
      "status": 400
    }
    
    
    #增加mapping的字段
    PUT /website/_mapping
    {
      "properties": {
        "new_field": {
          "type": "text"
        }
      }
    }
    
    {
      "acknowledged" : true
    }

    mapping复杂类型y以及object类型数据底层结构

    1. multivalue field
      {
          "tags": ["tag1", "tag2"]
      }

      建立索引时与string是一样的,数据类型不能混

    2. empty field
      null,[],[null]
    3. object field
      初始化数据:
      PUT /company/employee/1
      {
        "address": {
          "country": "china",
          "province": "guangdong",
          "city": "guangzhou"
        },
        "name": "jack",
        "age": 27,
        "join_date": "2017-01-01"
      }

      查看mapping

      GET /company/_mapping/employee
      {
        "company": {
          "mappings": {
            "employee": {
              "properties": {
                "address": {
                  "properties": {
                    "city": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "country": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "province": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    }
                  }
                },
                "age": {
                  "type": "long"
                },
                "join_date": {
                  "type": "date"
                },
                "name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            }
          }
        }
      }
      View Code

      object field底层解析

      {
        "address": {
          "country": "china",
          "province": "guangdong",
          "city": "guangzhou"
        },
        "name": "jack",
        "age": 27,
        "join_date": "2017-01-01"
      }

      ↓↓↓↓

      {
          "name":            [jack],
          "age":          [27],
          "join_date":      [2017-01-01],
          "address.country":         [china],
          "address.province":   [guangdong],
          "address.city":  [guangzhou]
      }
      {
          "authors": [
              { "age": 26, "name": "Jack White"},
              { "age": 55, "name": "Tom Jones"},
              { "age": 39, "name": "Kitty Smith"}
          ]
      }

      ↓↓↓↓

      {
          "authors.age":    [26, 55, 39],
          "authors.name":   [jack, white, tom, jones, kitty, smith]
      }
  • 相关阅读:
    Spring IoC和AOP使用扩展(二)
    Spring核心概念(一)
    MyBatis的动态SQL(五)
    MyBatis的SQL映射文件(四)
    初始myBatis(三)
    初始myBatis(二)
    微信小程序学习九 事件系统
    微信小程序学习八 wxs
    微信小程序学习七 视图层wxml语法
    微信小程序学习六 模块化
  • 原文地址:https://www.cnblogs.com/wyt007/p/11396510.html
Copyright © 2011-2022 走看看