zoukankan      html  css  js  c++  java
  • ES 相似度算法设置(续)

    Tuning BM25

    One of the nice features of BM25 is that, unlike TF/IDF, it has two parameters that allow it to be tuned:

    k1
    This parameter controls how quickly an increase in term frequency results in term-frequency saturation. The default value is 1.2. Lower values result in quicker saturation, and higher values in slower saturation.
    b
    This parameter controls how much effect field-length normalization should have. A value of 0.0disables normalization completely, and a value of 1.0 normalizes fully. The default is 0.75.

    The practicalities of tuning BM25 are another matter. The default values for k1 and b should be suitable for most document collections, but the optimal values really depend on the collection. Finding good values for your collection is a matter of adjusting, checking, and adjusting again.

    The similarity algorithm can be set on a per-field basis. It’s just a matter of specifying the chosen algorithm in the field’s mapping:

    PUT /my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "title": {
              "type":       "string",
              "similarity": "BM25" 
            },
            "body": {
              "type":       "string",
              "similarity": "default" 
            }
          }
      }
    }

    The title field uses BM25 similarity.

    The body field uses the default similarity (see Lucene’s Practical Scoring Function).

    Currently, it is not possible to change the similarity mapping for an existing field. You would need to reindex your data in order to do that.

    Configuring BM25

    Configuring a similarity is much like configuring an analyzer. Custom similarities can be specified when creating an index. For instance:

    PUT /my_index
    {
      "settings": {
        "similarity": {
          "my_bm25": { 
            "type": "BM25",
            "b":    0 
          }
        }
      },
      "mappings": {
        "doc": {
          "properties": {
            "title": {
              "type":       "string",
              "similarity": "my_bm25" 
            },
            "body": {
              "type":       "string",
              "similarity": "BM25" 
            }
          }
        }
      }
    }

    参考:https://www.elastic.co/guide/en/elasticsearch/guide/current/changing-similarities.html
  • 相关阅读:
    C++ 多线程
    C++ 信号处理
    首页流量监控代码
    macro-name replacement-text 宏 调试开关可以使用一个宏来实现 do { } while(0)
    color depth 色彩深度 像素深度
    数据更新 数据同步 起始点 幂等同步历史数据
    获取当前调用函数名 方法名
    版本号风格为 Major.Minor.Patch
    query_string查询支持全部的Apache Lucene查询语法 低频词划分依据 模糊查询 Disjunction Max
    Cutoff frequency
  • 原文地址:https://www.cnblogs.com/bonelee/p/6472828.html
Copyright © 2011-2022 走看看