zoukankan html css js c++ java

ES match match_phrase term willcard的查询原理

比如：要求实现SQL中like “%xxxx%”的匹配效果。

wildcard通配

这种效果在ES中最匹配的做法是用wildcard query通配，这种情况不会对query分词，而是直接遍历倒排索引逐个匹配计算，性能是无法想象的，大家慎用。

match全文匹配

效果最差的做法是用match全文检索，这种情况只要query分词的任何一个term出现在倒排中，就会召回文档，所以很容易搜出一些八竿子打不着的文档。

term匹配

如果你的搜索词本身不需要分词，只是一个term的话，那么直接走term query是最方便的。

match_phrase短语匹配

推荐一个折衷性能与准确度的做法就是用match_phrase短语匹配。

match_phrase的原理是对query分词，要求所有的term都出现在倒排中，并且连续且顺序一致的排列，下面一起看个例子。

我们采用ik_smart中文分词器，对”青岛上合蓝”分词：

1

2

3

4

5

6

7

[

'index' => 'article',

'body' => [

'analyzer' => 'ik_smart',

'text' => '青岛上合蓝',

]

]

得到结果：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

{

"tokens": [{

"token": "青岛",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

}, {

"token": "上合",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 1

}, {

"token": "蓝",

"start_offset": 4,

"end_offset": 5,

"type": "CN_WORD",

"position": 2

}]

}

大家看到，每个term都有一个position字段标识了term的位置，这将直接影响match_phrase是否可以召回。

接着我们进行搜索，query搜索词是：”上合蓝”，分词结果如下：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

{

"tokens": [{

"token": "上合",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

}, {

"token": "蓝",

"start_offset": 2,

"end_offset": 3,

"type": "CN_WORD",

"position": 1

}]

}

“上合”与”蓝”的position紧密排列，与之前”青岛上合蓝”中的”上合”与”蓝”顺序一致且连续，所以match_phrase搜索”上合蓝”可以召回上述的”青岛上合蓝”。

相反，如果你query搜索”青岛蓝”，那么”青岛”与”蓝”中间少了一个”上合”，所以无法召回：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

{

"tokens": [{

"token": "青岛",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

}, {

"token": "蓝",

"start_offset": 2,

"end_offset": 3,

"type": "CN_WORD",

"position": 1

}]

}

所以，match_phrase的确可以解决我们的这个场景。

因为match_phrase需要分词，所以如果分词效果不好（词库不足），query就会产生不同于doc的term，如果term都不同就肯定无法匹配了。

但是大家要注意，match_phrase与ik_max_word分词器是无法一起工作的，因为ik_max_word分词的term具有重叠问题，下面举个栗子：

先用ik_max_word分词：

1

2

3

4

5

6

7

[

'index' => 'article',

'body' => [

'analyzer' => 'ik_max_word',

'text' => '青岛上合蓝',

]

]

得到：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

{

"tokens": [{

"token": "青岛",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

}, {

"token": "岛上",

"start_offset": 1,

"end_offset": 3,

"type": "CN_WORD",

"position": 1

}, {

"token": "岛",

"start_offset": 1,

"end_offset": 2,

"type": "CN_WORD",

"position": 2

}, {

"token": "上合",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 3

}, {

"token": "蓝",

"start_offset": 4,

"end_offset": 5,

"type": "CN_WORD",

"position": 4

}]

}

你从”岛上”，”岛”就能看出，它的term之间具有重叠情况，这与ik_smart是完全不同的，因为ik_max_word的目标是尽可能产生更多的term组合，一般用于全文检索提高召回率。

接着我们搜索下面的query：

1

2

3

4

5

6

7

[

'index' => 'article',

'body' => [

'analyzer' => 'ik_max_word',

'text' => '青岛',

]

]

分词结果：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

{

"tokens": [{

"token": "青岛",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

}, {

"token": "岛",

"start_offset": 1,

"end_offset": 2,

"type": "CN_WORD",

"position": 1

}]

}

“青岛”与”岛”之间差着一个”岛上”，结果就是match_phrase不匹配。

最后给大家一个结论：

如果大家用match_phrase的话，需要注意2个方面：1）分词器不准会影响召回；2）只能用ik_smart。

其他对于ES 默认分词等其他分词同样适用

原文链接：https://yuerblog.cc/2018/09/13/ik-with-match_phrase

查看全文

相关阅读:
【python】opencv教程CV2模块——图片处理，裁剪缩放加边框
 【python】opencv教程CV2模块——画图，来左边跟我一起画星星在右边画彩虹
 【python】opencv教程CV2模块——图片处理，剪切缩放旋转
 【python】opencv教程CV2模块——批量视频截屏
 【python】opencv教程CV2模块——视频捕获，延时摄影视频、鬼畜表情包密集制作
 代码-JS之正则验证邮箱格式
 代码-JS之正则解决结巴程序
 代码-JS之IE+GOOGLE兼容函数
 代码-JS之正则replace函数
 代码-JS之下拉菜单

原文地址：https://www.cnblogs.com/cbugs/p/10788608.html

Copyright © 2011-2022 走看看