Elasticsearch 评分score计算中的Boost 和 queryNorm

zoukankan html css js c++ java

Elasticsearch 评分score计算中的Boost 和 queryNorm
本来没有这篇文章，在公司分享ES的时候遇到一个问题，使用boost的时候，怎么从评分score中知道boost的影响。
虽然我们从查询结果可以直观看到，boost起了应有的作用，但是在explain的时候，找了很久也不明白，boost去哪了？

这个问题花了点时间，不过还是挺值得。由于没有直接用过lucene，也从没想过到lucene网站上去看文档。在Elastic的文档中发现这样一段描述

In fact, reading the explain output is a little more complex than that. You won’t see the boost value or t.getBoost() mentioned in the explanation at all. Instead, the boost is rolled into the queryNorm that is applied to a particular term. Although we said that the queryNorm is the same for every term, you will see that the queryNorm for a boosted term is higher than the queryNorm for an unboosted term.

大概是说：从explain中寻找boost是有点复杂的，因为它被放到queryNorm的计算当中了

queryNorm是怎么计算的？

首先我们应该很容易在explain计算idf的部分看到queryNorm。而计算queryNorm的方式是在底部序号为1的lucene的参考资料中。
计算公式如下：

从这里可以看出我们还需要一个公式

有了这两个公式，我们就可以计算queryNorm，而queryNorm中融合了t.getBoost()这就是我们所关心Boost。显然Boost不是简单的乘上了某个倍数，所以我们很难直观的从score中看到，评分被乘了10或者20，也是我们在分享的时候，找了半天，也没有找到一个整数倍数的原因。

在Elasticsearch 中queryNorm是怎么计算的？

虽然我们有了公式，不过利用公式带入到我们的查询参数中会发现，数值还是有点偏差，有一些细节在公式中并没有体现。通过实验（非源码）我大概能了解计算方式，这里我就举一个实际例子来看ES怎么计算queryNorm。

首先设计一条查询语句，这里不讨论idf的计算，设计的查询中idf都是1
```
{
    "size":30,
    "query":{
        "bool": {
          "should": [
            {
              "match": {
                "name": {
                  "query": "便宜了",
                  "boost": 1
                }
              }
            },
            {
              "match": {
                "server": {
                  "query": "电信",
                  "boost": 1
                }
              }
            }
          ]
        }
    }
}
```
这是一个bool查询，（Lucene (and thus Elasticsearch) uses the Boolean model to find matching documents，我们的很多查询其实都被看做bool查询，ES只是提供了比较友好的其他查询方式，比如terms查询就是一种bool的should查询，或者直观一点，就是or条件查询）。boost都是1，默认的boost也是1。根据公式，我们按照每个字一个词进行分词的情况下，一共搜索了5个字，计算queryNorm的方式如下：
```
1/Math.sqrt(5) = 0.4472136
```
符合我们的预期，如果我们修改boost呢？
```
{
    "size":30,
    "query":{
        "bool": {
          "should": [
            {
              "match": {
                "name": {
                  "query": "便宜了",
                  "boost": 4
                }
              }
            },
            {
              "match": {
                "server": {
                  "query": "电信",
                  "boost": 1
                }
              }
            }
          ]
        }
    }
}
```
按照公式t.getBoost() 在计算“便宜了” 三个字的时候，应该要乘以4，计算公式应该是
```
1/Math.sqrt((1 * 4)^2 * 3 + 2 * 1) =0.1414213
```
但实际情况并非如此，其实ES在处理这个时候，如果较大的boost命中，es将小的那个值变成了0.25，也就是4分之一
```
1/Math.sqrt(3 + 2 * (1 * 0.25)^2) = 0.5656854
```
反过来如果，较大的boost没有命中，就会放大较大的boost的影响，采用第一个算法取用 0.1414213，所以如果有两个文档分别命中：
- name命中，无论server是否命中，采用0.5656854
- name无命中， server命中，采用，0.1414213
两个数相除正好是4倍左右，可以看到，如果字数差距再大一些，倍数可能不是4，会有一定偏差

note:以上分析结果基于试验，ES和lucene源码不一定是这样实现，毕竟公式可以各种变化计算来达到4倍差值。

参考资料：
1. http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
2. https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm
查看全文

相关阅读:
【Oracle11g】06_网络配置
 【Python3 爬虫】U20_正则表达式爬取古诗文网
 【Oracle11g】05_完整性约束
 【Python3 爬虫】U19_正则表达式之re模块其他函数
 【Python3 爬虫】U18_正则表达式之group分组
 【Python3 爬虫】U17_正则表达式之转义字符和原生字符
 【Python3 爬虫】U16_正则表达式之开始结束和或语法
 常见的概率分布
 广义线性模型
 gamma函数及相关其分布

原文地址：https://www.cnblogs.com/didda/p/5283753.html

Elasticsearch 评分score计算中的Boost 和 queryNorm

queryNorm是怎么计算的？

在Elasticsearch 中queryNorm是怎么计算的？