zoukankan      html  css  js  c++  java
  • ElasticSearch的评分机制详解

    1. 评分机制详解

    1.1. 评分机制 TFIDF

    1.1.1 算法介绍

    relevance score算法,简单来说,就是计算出,一个索引中的文本,与搜索文本,他们之间的关联匹配程度。

    Elasticsearch使用的是 term frequency/inverse document frequency算法,简称为TF/IDF算法。TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)

    Term frequency:搜索文本中的各个词条在field文本中出现了多少次,出现次数越多,就越相关。

    1571494142950

    举例:搜索请求:hello world

    doc1 : hello you and me,and world is very good.

    doc2 : hello,how are you

    Inverse document frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多,就越不相关.

    1571494159465

    1571494176760

    举例:搜索请求:hello world

    doc1 : hello ,today is very good

    doc2 : hi world ,how are you

    整个index中1亿条数据。hello的document 1000个,有world的document 有100个。

    doc2 更相关

    Field-length norm:field长度,field越长,相关度越弱

    举例:搜索请求:hello world

    doc1 : {"title":"hello article","content ":"balabalabal 1万个"}

    doc2 : {"title":"my article","content ":"balabalabal 1万个,world"}

    1.1.2 _score是如何被计算出来的

    GET /book/_search?explain=true
    {
      "query": {
        "match": {
          "description": "java程序员"
        }
      }
    }
    

    返回

    {
      "took" : 5,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 2.137549,
        "hits" : [
          {
            "_shard" : "[book][0]",
            "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
            "_index" : "book",
            "_type" : "_doc",
            "_id" : "3",
            "_score" : 2.137549,
            "_source" : {
              "name" : "spring开发基础",
              "description" : "spring 在java领域非常流行,java程序员都在用。",
              "studymodel" : "201001",
              "price" : 88.6,
              "timestamp" : "2019-08-24 19:11:35",
              "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
              "tags" : [
                "spring",
                "java"
              ]
            },
            "_explanation" : {
              "value" : 2.137549,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.7936629,
                  "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 0.7936629,
                      "description" : "score(freq=2.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.47000363,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 2,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 3,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.7675597,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 2.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 12.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 35.333332,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                },
                {
                  "value" : 1.3438859,
                  "description" : "weight(description:程序员 in 0) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 1.3438859,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.98082924,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 1,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 3,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.6227967,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 12.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 35.333332,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          },
          {
            "_shard" : "[book][0]",
            "_node" : "MDA45-r6SUGJ0ZyqyhTINA",
            "_index" : "book",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 0.57961315,
            "_source" : {
              "name" : "java编程思想",
              "description" : "java语言是世界第一编程语言,在软件开发领域使用人数最多。",
              "studymodel" : "201001",
              "price" : 68.6,
              "timestamp" : "2019-08-25 19:11:35",
              "pic" : "group1/M00/00/00/wKhlQFs6RCeAY0pHAAJx5ZjNDEM428.jpg",
              "tags" : [
                "java",
                "dev"
              ]
            },
            "_explanation" : {
              "value" : 0.57961315,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.57961315,
                  "description" : "weight(description:java in 0) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 0.57961315,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.47000363,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 2,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 3,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.56055,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 19.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 35.333332,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          }
        ]
      }
    }
    

    1.1.3 分析一个document是如何被匹配上的

    GET /book/_explain/3
    {
      "query": {
        "match": {
          "description": "java程序员"
        }
      }
    }
    

    1.2. Doc value

    搜索的时候,要依靠倒排索引;排序的时候,需要依靠正排索引,看到每个document的每个field,然后进行排序,所谓的正排索引,其实就是doc values

    在建立索引的时候,一方面会建立倒排索引,以供搜索用;一方面会建立正排索引,也就是doc values,以供排序,聚合,过滤等操作使用

    doc values是被保存在磁盘上的,此时如果内存足够,os会自动将其缓存在内存中,性能还是会很高;如果内存不足够,os会将其写入磁盘上

    倒排索引

    doc1: hello world you and me

    doc2: hi, world, how are you

    term doc1 doc2
    hello *
    world * *
    you * *
    and *
    me *
    hi *
    how *
    are *

    搜索时:

    hello you --> hello, you

    hello --> doc1

    you --> doc1,doc2

    doc1: hello world you and me

    doc2: hi, world, how are you

    sort by 出现问题

    正排索引

    doc1: { "name": "jack", "age": 27 }

    doc2: { "name": "tom", "age": 30 }

    document name age
    doc1 jack 27
    doc2 tom 30

    1.3. query phase

    1.3.1、query phase

    (1)搜索请求发送到某一个coordinate node,构构建一个priority queue,长度以paging操作from和size为准,默认为10

    (2)coordinate node将请求转发到所有shard,每个shard本地搜索,并构建一个本地的priority queue

    (3)各个shard将自己的priority queue返回给coordinate node,并构建一个全局的priority queue

    1.3.2、replica shard如何提升搜索吞吐量

    一次请求要打到所有shard的一个replica/primary上去,如果每个shard都有多个replica,那么同时并发过来的搜索请求可以同时打到其他的replica上去

    1.4. fetch phase

    1.4.1、fetch phbase工作流程

    (1)coordinate node构建完priority queue之后,就发送mget请求去所有shard上获取对应的document

    (2)各个shard将document返回给coordinate node

    (3)coordinate node将合并后的document结果返回给client客户端

    1.4.2、一般搜索,如果不加from和size,就默认搜索前10条,按照_score排序

    1.5. 搜索参数小总结

    1、preference

    决定了哪些shard会被用来执行搜索操作

    _primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

    bouncing results问题,两个document排序,field值相同;不同的shard上,可能排序不同;每次请求轮询打到不同的replica shard上;每次页面上看到的搜索结果的排序都不一样。这就是bouncing result,也就是跳跃的结果。

    搜索的时候,是轮询将搜索请求发送到每一个replica shard(primary shard),但是在不同的shard上,可能document的排序不同

    解决方案就是将preference设置为一个字符串,比如说user_id,让每个user每次搜索的时候,都使用同一个replica shard去执行,就不会看到bouncing results了

    2、timeout

    主要就是限定在一定时间内,将部分获取到的数据直接返回,避免查询耗时过长

    3、routing

    document文档路由,_id路由,routing=user_id,这样的话可以让同一个user对应的数据到一个shard上去

    4、search_type

    default:query_then_fetch

    dfs_query_then_fetch,可以提升revelance sort精准度

  • 相关阅读:
    真不容易...终于我也有了个js的语法高亮
    持久层相关概念
    测试语法高亮显示
    BSTR简介和内部结构
    Debugging JavaScript in Your Applications
    Google Analytics 跟踪代码迁移手册
    我对事件驱动的理解
    imagettftext 可调整字间距输出
    投票机的实现及相关技术
    js光标定位到文本末尾
  • 原文地址:https://www.cnblogs.com/dalianpai/p/13914352.html
Copyright © 2011-2022 走看看