zoukankan      html  css  js  c++  java
  • Elasticsearch scoring detailed explanation

    Score computation mechanism

    I am learning Elasticsearch these days, so I’m really curious about how Elasticsearch compute the score of retrieved documents.

    在Elasticsearch的官方文档中。给出了文档score的计算的公式,可是当中有个queryNorm官方给出的解释很的不清楚,可是我又很想知道在做查询的时候针对不同的field的query-time boosting 是怎样加入到score 的计算过程中的。所以花了一些时间特别的研究了一下每一步的score的计算方式。以下会具体给出queryNorm的计算过程。

    文本主要參考 Lucene’s Practical Scoring Function 中给出的score计算公式。

    Important backgroud to keep in Mind

    一定要记住,在Elasticsearch计算每一个document score的时候,是以shard为单位的。也就是说计算 tfidfnorm的时候。不是以index为基本单位。而是以shard为基本单位,这就涉及到了Elasticsearch建立索引的内部机制。 由于Elasticsearch的每一个索引能够分为多个shard,每一个shard有可能分布在不同的server上,所以以shard为基本单位计算score是合理的,假设一个index包括多个shard。那个搜索会在每一个shard上进行,然后计算每一个shard内找到的文档的score。终于将全部的shard的结果依据score进行又一次排序。

    同一时候。还要注意一点。就是即使以shard为单位。事实上在真正计算score的时候。是分别在一篇文档的每一个field上进行计算,然后将不同的field上的score的加起来作为整个文档的终于score。

    (事实上Elasticsearch每一个field都分别建立了一个索引)

    Score Equation

    Lucene’s Practical Scoring Function 中给出的score计算公式例如以下:

    score(q,d)=queryNorm(q)coord(q,d)(tf(tind)idf(t)²t.getBoost()norm(t,d))(tinq) (1)

    上面的公式中,大部分都是比較简单好理解的,最复杂的部分在queryNorm,接下来会给出queryNorm的具体计算过程。

    score(q,d) 是在每一个field上分别计算的,然后求和(也取决于你是怎样让Elasticsearch计算的)。

    • Term frequency
      tf(t in d) = √frequency
      The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.
      事实上tf是在field中进行统计的。

    • Inverse document frequency
      idf(t) = 1 + log ( numDocs / (docFreq + 1))

      The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term.
      IDF也是在field中进行计算的。

    • Field-length norm
      norm(d) = 1 / √numTerms

      The field-length norm (norm) is the inverse square root of the number of terms in the field.

    上面的这三个特征是在文本检索中最经常使用的,在Elasticsearch中把他们三个相乘获得token的一个特征: tf * idf * norm

    Elasticsearch并没有採用Vector Space Model, 由于计算文档的向量比較费时间,而是採用了结合Boolean Model, TF/IDF Model 和Vector Space Model三种相结合的方式进行score计算。

    在公式(1)中能够看到。针对query中的每一个token都计算了该token的每一个field的score, 然后将每一个token的分数加起来,乘上queryNorm(q) 和coord(q,d) 就是终于的score。

    要特别注意。尽管Elasticsearch官方给出的公式在计算每一个token的 score的时候乘上了 t.getBoost(),可是实际在操作的时候并非这样进行的。

    实际计算的时候是把 t.getBoost() 放到了queryNorm(q) 计算中。而且queryNorm(q) 的计算也结合了 query-time boosting.
    t.getBoost()的官方解释:t.getBoost()
    In fact, reading the explain output is a little more complex than that. You won’t see the boost value or t.getBoost() mentioned in the explanation at all. Instead, the boost is rolled into the queryNorm that is applied to a particular term. Although we said that the queryNorm is the same for every term, you will see that the queryNorm for a boosted term is higher than the queryNorm for an unboosted term.

    所以,实际的计算过程。在每一个token的每篇文档的得分仅仅有: tf*idf*norm.
    而且,这个公式中的 idf(t)² 也是不正确的,依据elasticsearch给出的文档得分解释,应该是idf(t),而不是idf(t)² .

    特别强调: 每一个token的tf, idf, norm的计算都是以field为基础的。

    Query Coordination

    公式中的coord(q,d)比較好理解,大体意思就是说。假设query中有三个单词,那么在查找到的文档中。这三个单词出现的个数越多,则这个文档的相关性越大。


    比如,我查询“oracle database setup”
    在查找到的doc1的title field 中,仅仅出现了两个单词“oracle database”。那么这篇文档的field的coord(q,d)=2/3.

    The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

    Imagine that we have a query for quick brown fox, and that the weight for each term is 1.5. Without the coordination factor, the score would just be the sum of the weights of the terms in a document. For instance:

    Document with fox → score: 1.5
    Document with quick fox → score: 3.0
    Document with quick brown fox → score: 4.5
    The coordination factor multiplies the score by the number of matching terms in the document, and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:

    Document with fox → score: 1.5 * 1 / 3 = 0.5
    Document with quick fox → score: 3.0 * 2 / 3 = 2.0
    Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5
    The coordination factor results in the document that contains all three terms being much more relevant than the document that contains just two of them.

    Query Normalization Factor

    最后就剩下最难的queryNorm(q)啦。官方给出的解释稀里糊涂的。例如以下:
    The query normalization factor (queryNorm) is an attempt to normalize a query so that the results from one query may be compared with the results of another.
    queryNorm(q)的优点是使得不同的查询的结果的得分在同一个空间中。这个即使是不同的查询的结果也能够直接比較。

    Even though the intent of the query norm is to make results from different queries comparable, it doesn’t work very well. The only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.

    This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved, but a typical implementation is as follows:
    queryNorm = 1 / √sumOfSquaredWeights
    The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.
    The same query normalization factor is applied to every document, and you have no way of changing it. For all intents and purposes, it can be ignored.

    依照官方的这个说法,根本计算不出来在elasticsearch explain中的queryNorm。

    所以,咱们再看看lucene中是怎么定义的?

    queryNorm in Lucene

    TFIDFSimilarity中。对于queryNorm的定义例如以下:

    queryNorm(q)=queryNorm(sumOfSquaredWeights)=1sumOfSquaredWeights1/2

    sumOfSquaredWeights=q.getBoost()2(idf(t)t.getBoost())2(t in q)

    恩。这下子比較明确了,可是,q.getBoost()和t.getBoost()怎么得到?貌似就没有下文了。
    Lucene仅仅有以下的解释:
    t.getBoost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getBoost().
    这个解释基本上没有什么帮助。

    针对我之前的问题,我主要是想弄明确在给不同的field不同的boost之后,boost信息是怎样整合到queryNorm(q)中的。
    然后,经过一段很痛苦的调查,我弄明确了boost信息是怎样计算到queryNorm(q)里的。

    比如:我们要在elasticsearch中做例如以下查询:

    GET /test/news/_search?explain
    {
      "query": {
        "multi_match": {
          "query": "apple iphone6",
          "fields": ["title^3", "body^2"],
          "type": "most_fields"
        }
      }
    }

    这里,我的文档有”title”,”body”等字段。我想在“title”“body”两个字段上查询。而且给title field 一个3的boost, body field一个2的boost,而且我希望将每一个字段上的得分加起来最为整个文档得得分(”type”: “most_fields”)。

    这里我们能够理解为q.getBoost()在title字段得到3, 在body字段得到2,t.getBoost()在title字段得到3,body字段得到2.
    依照上面的sumOfSquaredWeights的计算公式,并不能得到elasticseach给出的explain中得queryNorm值。

    依据这个具体得样例,在某个field中的真实的queryNorm的真实的计算公式为:

    sumOfSquaredWeights=(1fieldBoost)2t inq(((idf(t)t.getBoost())2)(field in searchFields))

    真心搞了了好久才弄明确是通过这个公式计算sumOfSquaredWeights的,那些再非法转载的。诅咒点什么好呢。
    通过这个公式。能够清楚的看出field boost以及其它的boost的信息使怎样整合到queryNorm中的。

    好了,公式都搞明确了。通过一个实例计算下看看。

    实例计算

    首先。须要设置一下index, 让我们得index仅仅有一个shard, 这样score看起来比較简单一些。在有多个shard的情况下。会依据doc id进行hash运算,然后决定把doc放入哪个shard,那种情况下我们不能清楚得知道shard中包括哪些文档,不能清楚的计算得到tf,idf,nrom.

    PUT /test
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 1
      }
    }

    再做个简单的mapping

    PUT /test/_mapping/news
    {
      "properties": {
        "title": {
          "type": "string",
          "analyzer": "english"
        },
        "body": {
          "type": "string",
          "analyzer": "english"
        },
        "version": {
          "type": "string",
          "analyzer": "english"
        }
      }
    }

    然后,index两个文章。

    PUT /test/news/1
    {
      "title": "apple released iphone",
      "body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
    }
    
    PUT /test/news/2
    {
      "title": "microsoft suied apple",
      "body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
    }

    好啦。搜索来啦:

    GET /test/news/_search?explain
    {
      "query": {
        "multi_match": {
          "query": "apple iphone",
          "fields": ["title^8", "body^3"],
          "type": "most_fields"
        }
      }
    }

    JSON的检索结果比較多,所以就不全部给出了。给出部分跟我们计算相关得:
    1. 首先是apple在文档1的title中的计算得分:

    {
                                     "value": 0.14224225,
                                     "description": "score(doc=0,freq=1.0), product of:",
                                     "details": [
                                        {
                                           "value": 0.4784993,
                                           "description": "queryWeight, product of:",
                                           "details": [
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.80482966,
                                                 "description": "queryNorm"
                                              }
                                           ]
                                        },
                                        {
                                           "value": 0.29726744,
                                           "description": "fieldWeight in 0, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "tf(freq=1.0), with freq of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "termFreq=1.0"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.5,
                                                 "description": "fieldNorm(doc=0)"
                                              }
                                           ]
                                        }
                                     ]
                                  }

    我们能够自己计算。apple 的tf = 1, idf = 1+Math.log(maxDocs=2/ (1+1)) = 0.5945349,
    field norm = 1 / √3 = 0.5773502691896258, 可是由于elasticsearch仅仅採用了一个字节保存这个norm值,所以精度丢失,变成了0.5。

    然后。到了最关键得计算queryNorm啦:
    定义:
    apple 在title的 idf 为 idf1: idf1 = 0.5945349
    apple 在body的 idf 为 idf2: idf2 = 0.5945349
    apple在title,body两个字段都在两个文档中出现过,所以idf1=idf2=1+log(2/3)
    iphone 在title的 idf 为 idf3: idf3 = 1 = 1 + log(2/2)
    iphone 在body的 idf 为 idf4: idf4 = 1
    iphone在title, body两个字段都仅仅在一个文档中出现,所以idf1=idf2 = 1 + log(2/2)

    然后计算sumOfSquaredWeights:
    这个query在title中的sumOfSquaredWeights:
    1/8 * 1/8 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3) = 1.543803711784605
    queryNorm = 1/Math.sqrt(1.543803711784605) = 0.8048296354648813

    能够看到,这个query在title的queryNorm和elasticsearch给出的解释中的queryNorm全然一样。

    这个query在body中的sumOfSquaredWeights:
    1/3 * 1/3 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3)

    queryNorm compare

    当给定field boosting后,能够观察到,不同的field的queryNorm的比是和field boosting的比相等得。


    本例中,
    field boost 比为 8/3
    queryNorm 比为 0.80482966/0.30181113 = 8/3

    O(∩_∩)O哈哈~ COOL!

    以下给出了这个query的全部explain结果,有兴趣的朋友能够自己算算:

    {
       "took": 1,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "failed": 0
       },
       "hits": {
          "total": 2,
          "max_score": 0.6467803,
          "hits": [
             {
                "_shard": 0,
                "_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
                "_index": "test",
                "_type": "news",
                "_id": "1",
                "_score": 0.6467803,
                "_source": {
                   "title": "apple released iphone",
                   "body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
                },
                "_explanation": {
                   "value": 0.6467803,
                   "description": "sum of:",
                   "details": [
                      {
                         "value": 0.5446571,
                         "description": "sum of:",
                         "details": [
                            {
                               "value": 0.14224225,
                               "description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
                               "details": [
                                  {
                                     "value": 0.14224225,
                                     "description": "score(doc=0,freq=1.0), product of:",
                                     "details": [
                                        {
                                           "value": 0.4784993,
                                           "description": "queryWeight, product of:",
                                           "details": [
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.80482966,
                                                 "description": "queryNorm"
                                              }
                                           ]
                                        },
                                        {
                                           "value": 0.29726744,
                                           "description": "fieldWeight in 0, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "tf(freq=1.0), with freq of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "termFreq=1.0"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.5,
                                                 "description": "fieldNorm(doc=0)"
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            },
                            {
                               "value": 0.40241483,
                               "description": "weight(title:iphon in 0) [PerFieldSimilarity], result of:",
                               "details": [
                                  {
                                     "value": 0.40241483,
                                     "description": "score(doc=0,freq=1.0), product of:",
                                     "details": [
                                        {
                                           "value": 0.80482966,
                                           "description": "queryWeight, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "idf(docFreq=1, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.80482966,
                                                 "description": "queryNorm"
                                              }
                                           ]
                                        },
                                        {
                                           "value": 0.5,
                                           "description": "fieldWeight in 0, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "tf(freq=1.0), with freq of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "termFreq=1.0"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 1,
                                                 "description": "idf(docFreq=1, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.5,
                                                 "description": "fieldNorm(doc=0)"
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            }
                         ]
                      },
                      {
                         "value": 0.10212321,
                         "description": "sum of:",
                         "details": [
                            {
                               "value": 0.026670424,
                               "description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
                               "details": [
                                  {
                                     "value": 0.026670424,
                                     "description": "score(doc=0,freq=1.0), product of:",
                                     "details": [
                                        {
                                           "value": 0.17943723,
                                           "description": "queryWeight, product of:",
                                           "details": [
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.30181113,
                                                 "description": "queryNorm"
                                              }
                                           ]
                                        },
                                        {
                                           "value": 0.14863372,
                                           "description": "fieldWeight in 0, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "tf(freq=1.0), with freq of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "termFreq=1.0"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 0.5945349,
                                                 "description": "idf(docFreq=2, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.25,
                                                 "description": "fieldNorm(doc=0)"
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            },
                            {
                               "value": 0.07545278,
                               "description": "weight(body:iphon in 0) [PerFieldSimilarity], result of:",
                               "details": [
                                  {
                                     "value": 0.07545278,
                                     "description": "score(doc=0,freq=1.0), product of:",
                                     "details": [
                                        {
                                           "value": 0.30181113,
                                           "description": "queryWeight, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "idf(docFreq=1, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.30181113,
                                                 "description": "queryNorm"
                                              }
                                           ]
                                        },
                                        {
                                           "value": 0.25,
                                           "description": "fieldWeight in 0, product of:",
                                           "details": [
                                              {
                                                 "value": 1,
                                                 "description": "tf(freq=1.0), with freq of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "termFreq=1.0"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 1,
                                                 "description": "idf(docFreq=1, maxDocs=2)"
                                              },
                                              {
                                                 "value": 0.25,
                                                 "description": "fieldNorm(doc=0)"
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            }
                         ]
                      }
                   ]
                }
             },
             {
                "_shard": 0,
                "_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
                "_index": "test",
                "_type": "news",
                "_id": "2",
                "_score": 0.08997996,
                "_source": {
                   "title": "microsoft suied apple",
                   "body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
                },
                "_explanation": {
                   "value": 0.08997996,
                   "description": "sum of:",
                   "details": [
                      {
                         "value": 0.07112113,
                         "description": "product of:",
                         "details": [
                            {
                               "value": 0.14224225,
                               "description": "sum of:",
                               "details": [
                                  {
                                     "value": 0.14224225,
                                     "description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
                                     "details": [
                                        {
                                           "value": 0.14224225,
                                           "description": "score(doc=0,freq=1.0), product of:",
                                           "details": [
                                              {
                                                 "value": 0.4784993,
                                                 "description": "queryWeight, product of:",
                                                 "details": [
                                                    {
                                                       "value": 0.5945349,
                                                       "description": "idf(docFreq=2, maxDocs=2)"
                                                    },
                                                    {
                                                       "value": 0.80482966,
                                                       "description": "queryNorm"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 0.29726744,
                                                 "description": "fieldWeight in 0, product of:",
                                                 "details": [
                                                    {
                                                       "value": 1,
                                                       "description": "tf(freq=1.0), with freq of:",
                                                       "details": [
                                                          {
                                                             "value": 1,
                                                             "description": "termFreq=1.0"
                                                          }
                                                       ]
                                                    },
                                                    {
                                                       "value": 0.5945349,
                                                       "description": "idf(docFreq=2, maxDocs=2)"
                                                    },
                                                    {
                                                       "value": 0.5,
                                                       "description": "fieldNorm(doc=0)"
                                                    }
                                                 ]
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            },
                            {
                               "value": 0.5,
                               "description": "coord(1/2)"
                            }
                         ]
                      },
                      {
                         "value": 0.018858837,
                         "description": "product of:",
                         "details": [
                            {
                               "value": 0.037717674,
                               "description": "sum of:",
                               "details": [
                                  {
                                     "value": 0.037717674,
                                     "description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
                                     "details": [
                                        {
                                           "value": 0.037717674,
                                           "description": "score(doc=0,freq=2.0), product of:",
                                           "details": [
                                              {
                                                 "value": 0.17943723,
                                                 "description": "queryWeight, product of:",
                                                 "details": [
                                                    {
                                                       "value": 0.5945349,
                                                       "description": "idf(docFreq=2, maxDocs=2)"
                                                    },
                                                    {
                                                       "value": 0.30181113,
                                                       "description": "queryNorm"
                                                    }
                                                 ]
                                              },
                                              {
                                                 "value": 0.21019982,
                                                 "description": "fieldWeight in 0, product of:",
                                                 "details": [
                                                    {
                                                       "value": 1.4142135,
                                                       "description": "tf(freq=2.0), with freq of:",
                                                       "details": [
                                                          {
                                                             "value": 2,
                                                             "description": "termFreq=2.0"
                                                          }
                                                       ]
                                                    },
                                                    {
                                                       "value": 0.5945349,
                                                       "description": "idf(docFreq=2, maxDocs=2)"
                                                    },
                                                    {
                                                       "value": 0.25,
                                                       "description": "fieldNorm(doc=0)"
                                                    }
                                                 ]
                                              }
                                           ]
                                        }
                                     ]
                                  }
                               ]
                            },
                            {
                               "value": 0.5,
                               "description": "coord(1/2)"
                            }
                         ]
                      }
                   ]
                }
             }
          ]
       }
    }
  • 相关阅读:
    DOS 批处理命令For循环命令详解
    怎样在电脑上下载哔哩哔哩的视频?
    华为事件启思:美国究竟有多少高科技公司?
    常用电子书下载收藏
    [置顶] 【玩转cocos2d-x之七】场景类CCScene和布景类CCLayer
    递归循环JSON
    POJ_1365_Prime_Land
    WIX在VS2012中如何制作中文安装包
    PKU Online Judge 1054:Cube (设置根节点)
    MFC——AfxParseURL用法
  • 原文地址:https://www.cnblogs.com/wgwyanfs/p/7222579.html
Copyright © 2011-2022 走看看