zoukankan      html  css  js  c++  java
  • 关于Elasticsearch 使用 MatchPhrase搜索的一些坑

    • 对分词字段检索使用的通常是match查询,对于短语查询使用的是matchphrase查询,但是并不是matchphrase可以直接对分词字段进行不分词检索(也就是业务经常说的精确匹配),下面有个例子,使用Es的请注意。
    • 某个Index下面存有如下内容
        {
            "id": "1",
            "fulltext": "亚马逊卓越有限公司诉讼某某公司"
        }
      

      其中fulltext使用ik分词器进行分词存储,使用ik分词结果如下

        "tokens": [
            {
              "token": "亚马逊",
              "start_offset": 0,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 0
            },
            {
              "token": "亚",
              "start_offset": 0,
              "end_offset": 1,
              "type": "CN_WORD",
              "position": 1
            },
            {
              "token": "马",
              "start_offset": 1,
              "end_offset": 2,
              "type": "CN_CHAR",
              "position": 2
            },
            {
              "token": "逊",
              "start_offset": 2,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 3
            },
            {
              "token": "卓越",
              "start_offset": 3,
              "end_offset": 5,
              "type": "CN_WORD",
              "position": 4
            },
            {
              "token": "卓",
              "start_offset": 3,
              "end_offset": 4,
              "type": "CN_WORD",
              "position": 5
            },
            {
              "token": "越有",
              "start_offset": 4,
              "end_offset": 6,
              "type": "CN_WORD",
              "position": 6
            },
            {
              "token": "有限公司",
              "start_offset": 5,
              "end_offset": 9,
              "type": "CN_WORD",
              "position": 7
            },
            {
              "token": "有限",
              "start_offset": 5,
              "end_offset": 7,
              "type": "CN_WORD",
              "position": 8
            },
            {
              "token": "公司",
              "start_offset": 7,
              "end_offset": 9,
              "type": "CN_WORD",
              "position": 9
            },
            {
              "token": "诉讼",
              "start_offset": 9,
              "end_offset": 11,
              "type": "CN_WORD",
              "position": 10
            },
            {
              "token": "讼",
              "start_offset": 10,
              "end_offset": 11,
              "type": "CN_WORD",
              "position": 11
            },
            {
              "token": "某某",
              "start_offset": 11,
              "end_offset": 13,
              "type": "CN_WORD",
              "position": 12
            },
            {
              "token": "某公司",
              "start_offset": 12,
              "end_offset": 15,
              "type": "CN_WORD",
              "position": 13
            },
            {
              "token": "公司",
              "start_offset": 13,
              "end_offset": 15,
              "type": "CN_WORD",
              "position": 14
            }
          ]
      

    对于如上结果,如果进行matchphrase查询 “亚马逊卓越”,无法匹配出任何结果
    因为对 “亚马逊卓越” 进行分词后的结果为:

        {
          "tokens": [
            {
              "token": "亚马逊",
              "start_offset": 0,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 0
            },
            {
              "token": "亚",
              "start_offset": 0,
              "end_offset": 1,
              "type": "CN_WORD",
              "position": 1
            },
            {
              "token": "马",
              "start_offset": 1,
              "end_offset": 2,
              "type": "CN_CHAR",
              "position": 2
            },
            {
              "token": "逊",
              "start_offset": 2,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 3
            },
            {
              "token": "卓越",
              "start_offset": 3,
              "end_offset": 5,
              "type": "CN_WORD",
              "position": 4
            },
            {
              "token": "卓",
              "start_offset": 3,
              "end_offset": 4,
              "type": "CN_WORD",
              "position": 5
            },
            {
              "token": "越",
              "start_offset": 4,
              "end_offset": 5,
              "type": "CN_CHAR",
              "position": 6
            }
          ]
        }
    

    和存储的内容对比发现 原文存储中包含词语 “越有”,而查询语句中并不包含“越有”,包含的是“越”,因此使用matchphrase短语匹配失败,也就导致了无法检索出内容。
    还是这个例子,换个词语进行检索,使用“亚马逊卓越有”,会发现竟然检索出来了,对“亚马逊卓越有”进行分词得到如下结果:

         {
          "tokens": [
            {
              "token": "亚马逊",
              "start_offset": 0,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 0
            },
            {
              "token": "亚",
              "start_offset": 0,
              "end_offset": 1,
              "type": "CN_WORD",
              "position": 1
            },
            {
              "token": "马",
              "start_offset": 1,
              "end_offset": 2,
              "type": "CN_CHAR",
              "position": 2
            },
            {
              "token": "逊",
              "start_offset": 2,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 3
            },
            {
              "token": "卓越",
              "start_offset": 3,
              "end_offset": 5,
              "type": "CN_WORD",
              "position": 4
            },
            {
              "token": "卓",
              "start_offset": 3,
              "end_offset": 4,
              "type": "CN_WORD",
              "position": 5
            },
            {
              "token": "越有",
              "start_offset": 4,
              "end_offset": 6,
              "type": "CN_WORD",
              "position": 6
            }
          ]
        }
    

    注意到了吗?这里出现了越有这个词,这也就是说现在的分词结果和之前的全文分词结果完全一致了,所以matchphrash也就找到了结果。

    再换一个极端点的例子,使用“越有限公司”去进行检索,你会惊讶的发现,竟然还能检索出来,对“越有限公司”进行分词,结果如下:

        {
          "tokens": [
            {
              "token": "越有",
              "start_offset": 0,
              "end_offset": 2,
              "type": "CN_WORD",
              "position": 0
            },
            {
              "token": "有限公司",
              "start_offset": 1,
              "end_offset": 5,
              "type": "CN_WORD",
              "position": 1
            },
            {
              "token": "有限",
              "start_offset": 1,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 2
            },
            {
              "token": "公司",
              "start_offset": 3,
              "end_offset": 5,
              "type": "CN_WORD",
              "position": 3
            }
          ]
        }
    

    这个结果和原文中的结果又是完全一致(从越有之后的内容一致),所以匹配出来了结果,注意点这里有个词语“有限公司”,检索词语如果我换成了“越有限”,就会发现没有查询到内容,因为“越有限”分词结果为:

        {
          "tokens": [
            {
              "token": "越有",
              "start_offset": 0,
              "end_offset": 2,
              "type": "CN_WORD",
              "position": 0
            },
            {
              "token": "有限",
              "start_offset": 1,
              "end_offset": 3,
              "type": "CN_WORD",
              "position": 1
            }
          ]
        }
    

    “越有”这个词是包含的,”有限”这个词语也是包含的,但是中间隔了一个“有限公司”,所以没有完全一致,也就匹配不到结果了。这时候如果我检索条件设置matchphrase的slop=1,使用“越有限”就能匹配到结果了,现在可以明白了,其实position的位置差就是slop的值,而matchphrase并不是所谓的词语拼接进行匹配,还是需要进行分词,以及position匹配的。

  • 相关阅读:
    一周最新示例代码回顾 (4/23–4/29)
    优酷的投票
    Google API获取用户信息
    CPU性能分析
    有意思的排序算法快速排序
    http响应LastModified和ETag以及asp.net web api实现
    java/C#多态漫谈
    有意思的排序算法插入排序
    [Architecture Pattern] Repository
    50个jQuery代码段帮你成为更出色的JS开发者
  • 原文地址:https://www.cnblogs.com/eviltuzki/p/8183191.html
Copyright © 2011-2022 走看看