为Elasticsearch添加中文分词，对比分词器效果

zoukankan html css js c++ java

为Elasticsearch添加中文分词，对比分词器效果
http://keenwon.com/1404.html

Elasticsearch中，内置了很多分词器（analyzers），例如standard （标准分词器）、english（英文分词）和chinese （中文分词）。其中standard 就是无脑的一个一个词（汉字）切分，所以适用范围广，但是精准度低；english 对英文更加智能，可以识别单数负数，大小写，过滤stopwords（例如“the”这个词）等；chinese 效果很差，后面会演示。这次主要玩这几个内容：安装中文分词ik，对比不同分词器的效果，得出一个较佳的配置。关于Elasticsearch，之前还写过两篇文章：Elasticsearch的安装，运行和基本配置和备份和恢复，需要的可以看下。

安装中文分词ik

Elasticsearch的中文分词很烂，所以我们需要安装ik。首先从github上下载项目，解压：
1. cd /tmp
2. wget https://github.com/medcl/elasticsearch-analysis-ik/archive/master.zip
3. unzip master.zip
4. cd elasticsearch-analysis-ik/
然后使用mvn package 命令，编译出jar包 elasticsearch-analysis-ik-1.4.0.jar。
1. mvn package
将jar包复制到Elasticsearch的plugins/analysis-ik 目录下，再把解压出的ik目录（配置和词典等），复制到Elasticsearch的config 目录下。然后编辑配置文件elasticsearch.yml ，在后面加一行：
1. index.analysis.analyzer.ik.type : "ik"
重启service elasticsearch restart 。搞定。

如果上面的mvn搞不定的话，你可以直接从 elasticsearch-rtf 项目中找到编译好的jar包和配置文件（我就是怎么干的）。

【2014-12-14晚更新，今天是星期天，我在vps上安装ik分词，同样的步骤，总是提示MapperParsingException[Analyzer [ik] not found for field [cn]]，然后晚上跑到公司，发现我公司虚拟机上Elasticsearch的版本是1.3.2，vps上是1.3.4，猜是版本问题，直接把vps重新安装成最新的1.4.1，再安装ik，居然ok了……】

准备工作：创建索引，录入测试数据

先为后面的分词器效果对比做好准备，我的Elasticsearch部署在虚拟机 192.168.159.159:9200 上的，使用chrome的postman插件直接发http请求。第一步，创建index1 索引：
1. PUT http://192.168.159.159:9200/index1
2. {
3. "settings": {
4. "refresh_interval": "5s",
5. "number_of_shards" : 1, // 一个主节点
6. "number_of_replicas" : 0 // 0个副本，后面可以加
7. },
8. "mappings": {
9. "_default_":{
10. "_all": { "enabled": false } // 关闭_all字段，因为我们只搜索title字段
11. },
12. "resource": {
13. "dynamic": false, // 关闭“动态修改索引”
14. "properties": {
15. "title": {
16. "type": "string",
17. "index": "analyzed",
18. "fields": {
19. "cn": {
20. "type": "string",
21. "analyzer": "ik"
22. },
23. "en": {
24. "type": "string",
25. "analyzer": "english"
26. }
27. }
28. }
29. }
30. }
31. }
32. }
为了方便，这里的index1 索引，只有一个shards，没有副本。索引里只有一个叫resource 的type，只有一个字段title ，这就足够我们用了。title 本身使用标准分词器，title.cn 使用ik分词器，title.en 自带的英文分词器。然后是用bulk api批量添加数据进去：
1. POST http://192.168.159.159:9200/_bulk
2. { "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
3. { "title": "周星驰最新电影" }
4. { "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
5. { "title": "周星驰最好看的新电影" }
6. { "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
7. { "title": "周星驰最新电影，最好，新电影" }
8. { "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
9. { "title": "最最最最好的新新新新电影" }
10. { "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
11. { "title": "I'm not happy about the foxes" }
注意bulk api要“回车”换行，不然会报错。

各种比较

1、对比ik分词，chinese分词和standard分词
1. POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik
2. 联想召回笔记本电源线
ik测试结果：
1. {
2. "tokens": [
3. {
4. "token": "联想",
5. "start_offset": 0,
6. "end_offset": 2,
7. "type": "CN_WORD",
8. "position": 1
9. },
10. {
11. "token": "召回",
12. "start_offset": 2,
13. "end_offset": 4,
14. "type": "CN_WORD",
15. "position": 2
16. },
17. {
18. "token": "笔记本",
19. "start_offset": 4,
20. "end_offset": 7,
21. "type": "CN_WORD",
22. "position": 3
23. },
24. {
25. "token": "电源线",
26. "start_offset": 7,
27. "end_offset": 10,
28. "type": "CN_WORD",
29. "position": 4
30. }
31. ]
32. }
自带chinese和standard分词器的结果：
1. {
2. "tokens": [
3. {
4. "token": "联",
5. "start_offset": 0,
6. "end_offset": 1,
7. "type": "<IDEOGRAPHIC>",
8. "position": 1
9. },
10. {
11. "token": "想",
12. "start_offset": 1,
13. "end_offset": 2,
14. "type": "<IDEOGRAPHIC>",
15. "position": 2
16. },
17. {
18. "token": "召",
19. "start_offset": 2,
20. "end_offset": 3,
21. "type": "<IDEOGRAPHIC>",
22. "position": 3
23. },
24. {
25. "token": "回",
26. "start_offset": 3,
27. "end_offset": 4,
28. "type": "<IDEOGRAPHIC>",
29. "position": 4
30. },
31. {
32. "token": "笔",
33. "start_offset": 4,
34. "end_offset": 5,
35. "type": "<IDEOGRAPHIC>",
36. "position": 5
37. },
38. {
39. "token": "记",
40. "start_offset": 5,
41. "end_offset": 6,
42. "type": "<IDEOGRAPHIC>",
43. "position": 6
44. },
45. {
46. "token": "本",
47. "start_offset": 6,
48. "end_offset": 7,
49. "type": "<IDEOGRAPHIC>",
50. "position": 7
51. },
52. {
53. "token": "电",
54. "start_offset": 7,
55. "end_offset": 8,
56. "type": "<IDEOGRAPHIC>",
57. "position": 8
58. },
59. {
60. "token": "源",
61. "start_offset": 8,
62. "end_offset": 9,
63. "type": "<IDEOGRAPHIC>",
64. "position": 9
65. },
66. {
67. "token": "线",
68. "start_offset": 9,
69. "end_offset": 10,
70. "type": "<IDEOGRAPHIC>",
71. "position": 10
72. }
73. ]
74. }
结论不必多说，对于中文，官方的分词器十分弱。

2、搜索关键词“最新”和“fox”

测试方法：
1. POST http://192.168.159.159:9200/index1/resource/_search
2. {
3. "query": {
4. "multi_match": {
5. "type": "most_fields",
6. "query": "最新",
7. "fields": [ "title", "title.cn", "title.en" ]
8. }
9. }
10. }
我们修改query 和fields 字段来对比。

1）搜索“最新”，字段限制在title.cn 的结果（只展示hit部分）：
1. "hits": [
2. {
3. "_index": "index1",
4. "_type": "resource",
5. "_id": "1",
6. "_score": 1.0537746,
7. "_source": {
8. "title": "周星驰最新电影"
9. }
10. },
11. {
12. "_index": "index1",
13. "_type": "resource",
14. "_id": "3",
15. "_score": 0.9057159,
16. "_source": {
17. "title": "周星驰最新电影，最好，新电影"
18. }
19. },
20. {
21. "_index": "index1",
22. "_type": "resource",
23. "_id": "4",
24. "_score": 0.5319481,
25. "_source": {
26. "title": "最最最最好的新新新新电影"
27. }
28. },
29. {
30. "_index": "index1",
31. "_type": "resource",
32. "_id": "2",
33. "_score": 0.33246756,
34. "_source": {
35. "title": "周星驰最好看的新电影"
36. }
37. }
38. ]
再次搜索“最新”，字段限制在title ，title.en 的结果（只展示hit部分）：
1. "hits": [
2. {
3. "_index": "index1",
4. "_type": "resource",
5. "_id": "4",
6. "_score": 1,
7. "_source": {
8. "title": "最最最最好的新新新新电影"
9. }
10. },
11. {
12. "_index": "index1",
13. "_type": "resource",
14. "_id": "1",
15. "_score": 0.75,
16. "_source": {
17. "title": "周星驰最新电影"
18. }
19. },
20. {
21. "_index": "index1",
22. "_type": "resource",
23. "_id": "3",
24. "_score": 0.70710677,
25. "_source": {
26. "title": "周星驰最新电影，最好，新电影"
27. }
28. },
29. {
30. "_index": "index1",
31. "_type": "resource",
32. "_id": "2",
33. "_score": 0.625,
34. "_source": {
35. "title": "周星驰最好看的新电影"
36. }
37. }
38. ]
结论：如果没有使用ik中文分词，会把“最新”当成两个独立的“字”，搜索准确性低。

2）搜索“fox”，字段限制在title 和title.cn ，结果为空，对于它们两个分词器，fox和foxes不同。再次搜索“fox”，字段限制在title.en ，结果如下：
1. "hits": [
2. {
3. "_index": "index1",
4. "_type": "resource",
5. "_id": "5",
6. "_score": 0.9581454,
7. "_source": {
8. "title": "I'm not happy about the foxes"
9. }
10. }
11. ]
结论：中文和标准分词器，不对英文单词做任何处理（单复数等），查全率低。

我的最佳配置

其实最开始创建的索引已经是最佳配置了，在title 下增加cn 和en 两个fields，这样对中文，英文和其他什么乱七八糟文的效果都好点。就像前面说的，title 使用标准分词器，title.cn 使用ik分词器，title.en 使用自带的英文分词器，每次搜索同时覆盖。
查看全文

相关阅读:
如何修改以前登录过的共享文件夹的用户名和密码以及查看或删除浏览器里保存的密码
 python-----获取ip的两种方法
 SSO(singlesignon)单点登录
 ajax
mybatis14--注解的配置
 mybatis13--2级缓存
 mybatis12--一级缓存
 mybatis11--多对多关联查询
 mybatis10--自连接多对一查询
 mybatis09--自连接一对多查询

原文地址：https://www.cnblogs.com/valor-xh/p/6143399.html

为Elasticsearch添加中文分词，对比分词器效果

安装中文分词ik

准备工作：创建索引，录入测试数据

各种比较

1、对比ik分词，chinese分词和standard分词

2、搜索关键词“最新”和“fox”

我的最佳配置