一个tokenizer(分词器)接收一串字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流
例如,whitespace tokenizer遇到空白字符时分割文本,它会将I am zyn分割为【I、am、zyn】。
该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。elasticsearch提供了很多内置的分词器(标准分词器),可以用来构建custom analyzers(自定义分词器)。
关于分词器: https://www.elastic.co/guide/en/elasticsearch/reference/7.6/analysis.html
标准分词器 standard,按空格分
POST _analyze { "tokenizer": "standard", "text": "Hello the world." }
执行结果:
{ "tokens" : [ { "token" : "Hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "the", "start_offset" : 6, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "world", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 2 } ] }
但是es中默认的分词器,都是支持英文的,中文需要安装自己的分词器
ik分词器https://github.com/medcl/elasticsearch-analysis-ik/releases
查看es版本以安装对应版本号的ik分词器
[vagrant@10 ~]$ curl http://192.168.56.10:9200/ { "name" : "3cafb1a4b1b3", "cluster_name" : "elasticsearch", "cluster_uuid" : "0cNA2l38RFK6LMHislSvNg", "version" : { "number" : "7.4.2", "build_flavor" : "default", "build_type" : "docker", "build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96", "build_date" : "2019-10-28T20:40:44.881551Z", "build_snapshot" : false, "lucene_version" : "8.2.0", "minimum_wire_compatibility_version" : "6.8.0", "minimum_index_compatibility_version" : "6.0.0-beta1" }, "tagline" : "You Know, for Search" }
git上看不到v7.4.2版本了,试了下直接输入地址就下了
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
下载完后,解压到es容器内部plugins目录中(或es的对应的映射目录)就可以使用了。
XSHELL和xftp正版免费下载参考:https://www.cnblogs.com/qingshan-tang/p/12855807.html
我这下载太慢,转而又回去命令安装了===!!!
vagrant ssh连接虚拟机后,su root转管理员,安装.wget
vagrant ssh
su root
vagrant
yum install wget
安装完wget后,转到es /plugins目录安装ik
[root@10 /]# cd /mydata/elasticsearch/plugins/ [root@10 plugins]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
安装unzip解压命令,然后建ik目录,解压到ik中,然后删除解压过的zip
yum install unzip mkdir ik cd ik unzip elasticsearch-analysis-ik-7.4.2.zip rm elasticsearch-analysis-ik-7.4.2.zip
装ik文件夹可读可写可执行
[root@10 plugins]# chmod -R 777 ik/
查看ik分词器是否安装完成:
进入es控制台:
docker exec -it 3caf /bin/bash
显示ik证明ik分词器安装成功了,然后退出es容器重启es
试下效果:
ik_smart:智能分词
POST _analyze { "tokenizer": "ik_smart", "text": "我是中国人" }
结果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 } ] }
ik_max_word:最大单词组合
POST _analyze { "tokenizer": "ik_max_word", "text": "我是中国人" }
结果:
{ "tokens" : [ { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "是", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "中国", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 3 }, { "token" : "国人", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 4 } ] }