一、分词器
1、作用:①切词
②normalizaton(提升recall召回率:能搜索到的结果的比率)
2、分析器
①character filter:分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch
A、HTML Strip Character Filter:html_strip
escaped_tags 需要保留的html标签
PUT my_index { "settings": { "analysis": { "char_filter": { "my_char_filter":{ "type":"html_strip",
"escaped_tags":["a"] } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":"my_char_filter" } } } } }
测试分词
GET my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "liuyucheng <a><b>edu</b></a>"
}
B、Mapping Character Filter:type mapping
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": [ "my_char_filter" ] } }, "char_filter": { "my_char_filter": { "type": "mapping", "mappings": [ "٠ => 0", "١ => 1", "٢ => 2", "٣ => 3", "٤ => 4", "٥ => 5", "٦ => 6", "٧ => 7", "٨ => 8", "٩ => 9" ] } } } } }
测试分词 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My license plate is ٢٥٠١٥" }
C、Pattern Replace Character Filter:正则替换type pattern_replace
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "standard", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "pattern_replace", "pattern": "(\d+)-(?=\d)", "replacement": "$1_" } } } } }
测试分词 POST my_index/_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }
②tokenizer:分词器
③token filter:时态转换、大小写转换、同义词转换、语气词处理等
比如:has=>have him=>he apples=>apple the/oh/a=>干掉
A、大小写 lowercase token filter
GET _analyze { "tokenizer" : "standard", "filter" : ["lowercase"], "text" : "THE Quick FoX JUMPs" } GET /_analyze { "tokenizer": "standard", "filter": [ { "type": "condition", "filter": [ "lowercase" ], "script": { "source": "token.getTerm().length() < 5" } } ], "text": "THE QUICK BROWN FOX" }
B、停用词 stopwords token filter
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"standard", "stopwords":"_english_" } } } } } GET my_index/_analyze { "analyzer": "my_analyzer", "text": "Teacher Ma is in the restroom" }
C、分词器 tokenizer :standard
GET /my_index/_analyze { "text": "江山如此多娇,小姐姐哪里可以撩", "analyzer": "standard" }
D、自定义 analysis,设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple
PUT /test_analysis { "settings": { "analysis": { "char_filter": { "test_char_filter": { "type": "mapping", "mappings": [ "& => and", "| => or" ] } }, "filter": { "test_stopwords": { "type": "stop", "stopwords": ["is","in","at","the","a","for"] } }, "tokenizer": { "punctuation": { "type": "pattern", "pattern": "[ .,!?]" } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "test_char_filter" ], "tokenizer": "standard", "filter": ["lowercase","test_stopwords"] } } } } } GET /test_analysis/_analyze { "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!", "analyzer": "my_analyzer" }
E、创建mapping时候指定分词器
PUT /test_analysis/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "test_analysis" } } }
二、中文分词器
(1) 中文分词器:
① IK分词:ES的安装目录 不要有中文 空格
1) 下载:https://github.com/medcl/elasticsearch-analysis-ik
2) 创建插件文件夹 cd your-es-root/plugins/ && mkdir ik
3) 将插件解压缩到文件夹 your-es-root/plugins/ik
4) 重新启动es
② 两种analyzer
1) ik_max_word:细粒度
2) ik_smart:粗粒度
③ IK文件描述
1) IKAnalyzer.cfg.xml:IK分词配置文件
2) 主词库:main.dic
3) 英文停用词:stopword.dic,不会建立在倒排索引中
4) 特殊词库:
- quantifier.dic:特殊词库:计量单位等
- suffix.dic:特殊词库:后缀名
- surname.dic:特殊词库:百家姓
- preposition:特殊词库:语气词
5) 自定义词库:比如当下流行词:857、emmm...、渣女、舔屏、996
6) 热更新:
- 修改ik分词器源码
- 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modified和tag两个http响应头,来提供词语的热更新