analyzer
分词器使用的两个情形:
1,Index time analysis. 创建或者更新文档时,会对文档进行分词
2,Search time analysis. 查询时,对查询语句分词
指定查询时使用哪个分词器的方式有:
- 查询时通过analyzer指定分词器
GET test_index/_search { "query": { "match": { "name": { "query": "lin", "analyzer": "standard" } } } }
- 创建index mapping时指定search_analyzer
PUT test_index { "mappings": { "doc": { "properties": { "title":{ "type": "text", "analyzer": "whitespace", "search_analyzer": "standard" } } } } }
索引时分词是通过配置 Index mapping中的每个字段的参数analyzer指定的
# 不指定分词时,会使用默认的standard PUT test_index { "mappings": { "doc": { "properties": { "title":{ "type": "text", "analyzer": "whitespace" #指定分词器,es内置有多种analyzer } } }}}
注意:
- 明确字段是否需要分词,不需要分词的字段将type设置为keyword,可以节省空间和提高写性能。
_analyzer api
GET _analyze { "analyzer": "standard", "text": "this is a test" } # 可以查看text的内容使用standard分词后的结果
{ "tokens": [ { "token": "this", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 0 }, { "token": "is", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 1 }, { "token": "a", "start_offset": 8, "end_offset": 9, "type": "<ALPHANUM>", "position": 2 }, { "token": "test", "start_offset": 10, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 } ] }
设置analyzer
PUT test { "settings": { "analysis": { #自定义分词器 "analyzer": { # 关键字 "my_analyzer":{ # 自定义的分词器 "type":"standard", #分词器类型standard "stopwords":"_english_" #standard分词器的参数,默认的stopwords是\_none_ } } } }, "mappings": { "doc":{ "properties": { "my_text":{ "type": "text", "analyzer": "standard", # my_text字段使用standard分词器 "fields": { "english":{ # my_text.english字段使用上面自定义得my_analyzer分词器 "type": "text", "analyzer": "my_analyzer" }}}}}}} POST test/_analyze { "field": "my_text", # my_text字段使用的是standard分词器 "text": ["The test message."] } -------------->[the,test,message] POST test/_analyze { "field": "my_text.english", #my_text.english使用的是my_analyzer分词器 "text": ["The test message."] } ------------>[test,message]
ES内置了很多种analyzer。比如:
- standard 由以下组成
- tokenizer:Standard Tokenizer
- token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter
-
analyzer API测试 : POST _analyze { "analyzer": "standard", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." }
结果为:
-
{ "tokens": [ { "token": "the", "start_offset": 0, "end_offset": 3, "type": "<ALPHANUM>", "position": 0 }, { "token": "2", "start_offset": 4, "end_offset": 5, "type": "<NUM>", "position": 1 }, { "token": "quick", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "brown", "start_offset": 12, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "foxes", "start_offset": 18, "end_offset": 23, "type": "<ALPHANUM>", "position": 4 }, { "token": "jumped", "start_offset": 24, "end_offset": 30, "type": "<ALPHANUM>", "position": 5 }, { "token": "over", "start_offset": 31, "end_offset": 35, "type": "<ALPHANUM>", "position": 6 }, { "token": "the", "start_offset": 36, "end_offset": 39, "type": "<ALPHANUM>", "position": 7 }, { "token": "lazy", "start_offset": 40, "end_offset": 44, "type": "<ALPHANUM>", "position": 8 }, { "token": "dog's", "start_offset": 45, "end_offset": 50, "type": "<ALPHANUM>", "position": 9 }, { "token": "bone", "start_offset": 51, "end_offset": 55, "type": "<ALPHANUM>", "position": 10 } ] }
- whitespace 空格为分隔符
POST _analyze { "analyzer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } --> [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]
simple
POST _analyze { "analyzer": "simple", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } ---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
stop 默认stopwords用_english_
POST _analyze { "analyzer": "stop", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } -->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] 可选参数: # stopwords # stopwords_path
keyword 不分词的
POST _analyze { "analyzer": "keyword", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一条完整的语句
==================================================================================
第三方analyzer插件---中文分词(ik分词器)
es内置很多分词器,但是对中文分词并不友好,例如使用standard分词器对一句中文话进行分词,会分成一个字一个字的。这时可以使用第三方的Analyzer插件,比如 ik、pinyin等。这里以ik为例
1,首先安装插件,重启es:
# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip # /etc/init.d/elasticsearch restart
2,使用示例:
GET _analyze { "analyzer": "ik_max_word", "text": "你好吗?我有一句话要对你说呀。" }
{ "tokens": [ { "token": "你好", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0 }, { "token": "好吗", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 1 }, { "token": "我", "start_offset": 4, "end_offset": 5, "type": "CN_CHAR", "position": 2 }, { "token": "有", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 3 }, { "token": "一句话", "start_offset": 6, "end_offset": 9, "type": "CN_WORD", "position": 4 }, { "token": "一句", "start_offset": 6, "end_offset": 8, "type": "CN_WORD", "position": 5 }, { "token": "一", "start_offset": 6, "end_offset": 7, "type": "TYPE_CNUM", "position": 6 }, { "token": "句话", "start_offset": 7, "end_offset": 9, "type": "CN_WORD", "position": 7 }, { "token": "句", "start_offset": 7, "end_offset": 8, "type": "COUNT", "position": 8 }, { "token": "话", "start_offset": 8, "end_offset": 9, "type": "CN_CHAR", "position": 9 }, { "token": "要对", "start_offset": 9, "end_offset": 11, "type": "CN_WORD", "position": 10 }, { "token": "你", "start_offset": 11, "end_offset": 12, "type": "CN_CHAR", "position": 11 }, { "token": "说呀", "start_offset": 12, "end_offset": 14, "type": "CN_WORD", "position": 12 } ] } 分词结果
参考:https://github.com/medcl/elasticsearch-analysis-ik
还可以用内置的 character filter, tokenizer, token filter 组装一个analyzer(custom analyzer)
- custom 定制analyzer,由以下几部分组成
- 0个或多个e character filters
- 1个tokenizer
- 0个或多个 token filters
PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "char_filter":["html_strip"], "filter":["lowercase"] } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's <b> bone.</b>"] } 得到:[the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone]
自定义分词器
自定义分词需要在索引的配置中设定,如下所示:
PUT test_index { "settings": { "analysis": { # 分词设置,可以自定义 "char_filter": {}, #char_filter 关键字 "tokenizer": {}, #tokenizer 关键字 "filter": {}, #filter 关键字 "analyzer": {} #analyzer 关键字 } } }
character filter 在tokenizer之前对原始文本进行处理,比如增加,删除,替换字符等
会影响后续tokenizer解析的position和offset信息
html strip 除去html标签和转换html实体
(1)参数:escaped_tags不删除的标签
POST _analyze { "tokenizer": "keyword", "char_filter": ["html_strip"], "text": ["<p>I'm so <b>happy</b>!</p>"] } 得到: "token": """ I'm so happy! """ #配置示例 PUT t_index { "settings": { "analysis": { "analyzer": { #关键字 "my_analyzer":{ #自定义analyzer "tokenizer":"keyword", "char_filter":["my_char_filter"] } }, "char_filter": { #关键字 "my_char_filter":{ #自定义char_filter "type":"html_strip", "escaped_tags":["b"] #不从文本中删除的HTML标记数组 } }}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["<p>I'm so <b>happy</b>!</p>"] } 得到: "token": """ I'm so <b>happy</b>! """,
mapping 映射类型,以下参数必须二选一
(1)mappings 指定一组映射,每个映射格式为 key=>value
(2)mappings_path 绝对路径或者相对于config路径 key=>value
PUT t_index { "settings": { "analysis": { "analyzer": { #关键字 "my_analyzer":{ #自定义分词器 "tokenizer":"standard", "char_filter":"my_char_filter" } }, "char_filter": { #关键字 "my_char_filter":{ #自定义char_filter "type":"mapping", "mappings":[ #指明映射关系 ":)=>happy", ":(=>sad" ] }}}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["i am so :)"] }得到 [i,am,so,happy]
pattern replace
(1)pattern参数 正则
(2)replacement 替换字符串 可以使用$1..$9
(3)flags 正则标志
tokenizer 将原始文档按照一定规则切分为单词
standard-------参数:max_token_length,最大token长度,默认是255
PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "tokenizer":"my_tokenizer" } }, "tokenizer": { "my_tokenizer":{ "type":"standard", "max_token_length":5 }}}}} POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ] # jumped 长度为6 在5这个位置被分割
letter 非字母时分成多个terms
POST _analyze { "tokenizer": "letter", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到 [ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
lowcase 跟letter tokenizer一样 ,同时将字母转化成小写
POST _analyze { "tokenizer": "lowercase", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } 得到 [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
whitespace 按照空白字符分成多个terms----参数:max_token_length
POST _analyze { "tokenizer": "whitespace", "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." } 得到 [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
keyword 空操作,输出完全相同的文本-----参数:buffer_size,单词一个term读入缓冲区的长度,默认256
POST _analyze { "tokenizer": "keyword", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] } 得到"token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一个完整的文本
token filter 针对tokenizer 输出的单词进行增删改等操作----lowercase 将输出的单词转化成小写
POST _analyze { "filter": ["lowercase"], "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"] } ---> "token": "the 2 quick brown-foxes jumped over the lazy dog's bone" PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":"lowercase" } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone"] }
stop 从token流中删除stop words 。
参数有: # stopwords 要使用的stopwords, 默认_english_ # stopwords_path # ignore_case 设置为true则为小写,默认false# remove_trailing PUT t_index { "settings": { "analysis": { "analyzer": { "my_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":"my_filter" } }, "filter": { "my_filter":{ "type":"stop", "stopwords":["and","or","not"] } } } } } POST t_index/_analyze { "analyzer": "my_analyzer", "text": ["lucky and happy not sad"] }-------------->[lucky,happy,sad]