zoukankan      html  css  js  c++  java
  • ES忽略TF-IDF评分——使用constant_score

    Ignoring TF/IDF

    Sometimes we just don’t care about TF/IDF. All we want to know is that a certain word appears in a field. Perhaps we are searching for a vacation home and we want to find houses that have as many of these features as possible:

    • WiFi
    • Garden
    • Pool

    The vacation home documents look something like this:

    { "description": "A delightful four-bedroomed house with ... " }

    We could use a simple match query:

    GET /_search
    {
      "query": {
        "match": {
          "description": "wifi garden pool"
        }
      }
    }

    However, this isn’t really full-text search. In this case, TF/IDF just gets in the way. We don’t care whether wifi is a common term, or how often it appears in the document. All we care about is that it does appear. In fact, we just want to rank houses by the number of features they have—the more, the better. If a feature is present, it should score 1, and if it isn’t, 0.

    constant_score Query

    Enter the constant_score query. This query can wrap either a query or a filter, and assigns a score of1 to any documents that match, regardless of TF/IDF:

    GET /_search
    {
      "query": {
        "bool": {
          "should": [
            { "constant_score": {
              "query": { "match": { "description": "wifi" }}
            }},
            { "constant_score": {
              "query": { "match": { "description": "garden" }}
            }},
            { "constant_score": {
              "query": { "match": { "description": "pool" }}
            }}
          ]
        }
      }
    }

    Perhaps not all features are equally important—some have more value to the user than others. If the most important feature is the pool, we could boost that clause to make it count for more:

    GET /_search
    {
      "query": {
        "bool": {
          "should": [
            { "constant_score": {
              "query": { "match": { "description": "wifi" }}
            }},
            { "constant_score": {
              "query": { "match": { "description": "garden" }}
            }},
            { "constant_score": {
              "boost":   2 
              "query": { "match": { "description": "pool" }}
            }}
          ]
        }
      }
    }

    A matching pool clause would add a score of 2, while the other clauses would add a score of only 1 each.

    Note

    The final score for each result is not simply the sum of the scores of all matching clauses. The coordination factor and query normalization factor are still taken into account.

    We could improve our vacation home documents by adding a not_analyzed features field to our vacation homes:

    { "features": [ "wifi", "pool", "garden" ] } 这样改写有什么好处?省索引空间吗?

    参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/ignoring-tfidf.html#ignoring-tfidf

  • 相关阅读:
    《一线架构师实践指南》第三编Refined Architecture阶段读后感
    大数据技术与应用课堂测试2-数据初级分析分类2
    大数据技术与应用课堂测试2-数据初级分析分类
    对Datax的理解
    2020春季学期第三周总结
    可测试性战术总结
    2020春季学期第一周总结
    以《淘宝网》为例,描述质量属性的六个常见属性场景
    软件架构师如何工作
    MapReduce+HIVE 课堂练习
  • 原文地址:https://www.cnblogs.com/bonelee/p/6475950.html
Copyright © 2011-2022 走看看