zoukankan html css js c++ java

Elasticsearch安装ik中文分词插件（四）

一、IK简介

　　IK Analyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始， IKAnalyzer已经推出了4个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。从3.0版本开始，IK发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。在2012版本中，IK实现了简单的分词歧义排除算法，标志着IK分词器从单纯的词典分词向模拟语义分词衍化。

　　IK Analyzer 2012特性:

采用了特有的“正向迭代最细粒度切分算法“，支持细粒度和智能分词两种切分模式。
在系统环境：Core2 i7 3.4G双核，4G内存，window 7 64位， Sun JDK 1.6_29 64位普通pc环境测试，IK2012具有160万字/秒（3000KB/S）的高速处理能力。
2012版本的智能分词模式支持简单的分词排歧义处理和数量词合并输出。
采用了多子处理器分析模式，支持：英文字母、数字、中文词汇等分词处理，兼容韩文、日文字符。
优化的词典存储，更小的内存占用。支持用户词典扩展定义。特别的，在2012版本，词典支持中文，英文，数字混合词语。

二、配置编译环境

　　从Github下载的IK分词是源码包，需要maven环境编译

　　1、下载maven

# wget http://mirrors.hust.edu.cn/apache/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz

　　2、解压　

# tar zxf apache-maven-3.3.9-bin.tar.gz -C /usr/src/

　　3、配置环境变量

# vi /etc/profile
    export MAVEN_HOME=/usr/local/apache-maven-3.3.9
    export PATH=$PATH:$MAVEN_HOME/bin
# source /etc/profile

三、安装IK分词插件

　　1、下载

　　　　到GitHub上下载适合ElasticSearch版本的IK，地址：https://github.com/medcl/elasticsearch-analysis-ik；也可以通过git clone https://github.com/medcl/elasticsearch-analysis-ik，下载分词器源码。

　　2、解压编译

# unzip elasticsearch-analysis-ik-master.zip
# cd elasticsearch-analysis-ik-master/
# mvn clean package

　　3、复制编译完成的IK分词到elasticsearch的插件路径

# mkdir $elasticsearch/plugins/ik
# cp target/releases/elasticsearch-analysis-ik-1.9.3.zip $elasticsearch/plugins/ik/
# cd $elasticsearch/plugins/ik/
# unzip elasticsearch-analysis-ik-1.9.3.zip

　　4、重启elasticsearch，使ik插件生效

# /etc/init.d/elasticsearch restart

四、ik分词测试

　　1、创建一个索引，名为“index”

# curl -XPUT http://localhost:9200/index

　　2、为“index”创建mapping

# curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
    "fulltext": {
            "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'

3、测试

# curl 'http://10.10.10.26:9200/index/_analyze?analyzer=ik&pretty=true' -d '{"text":"中华人民共和国国歌"}'

显示如下：

{
  "tokens" : [ {
    "token" : "中华人民共和国",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "中华人民",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "中华",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "华人",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "人民共和国",
    "start_offset" : 2,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "人民",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 5
  }, {
    "token" : "共和国",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 6
  }, {
    "token" : "共和",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 7
  }, {
    "token" : "国",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 8
  }, {
    "token" : "国歌",
    "start_offset" : 7,
    "end_offset" : 9,
    "type" : "CN_WORD",
    "position" : 9
  } ]
}

elasticsearch-analysis-ik的Github地址：https://github.com/medcl/elasticsearch-analysis-ik

作者：Orgliny
出处：https://www.cnblogs.com/Orgliny
本文采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接。

查看全文

相关阅读:
文本特殊符号汇集
 十大编程算法助程序员走上高手之路
 单例模式(Singleton)
flink time and watermark
关于maven依赖关系的问题
 幂等
 乐观锁和悲观锁的一个例子
 Elasticsearch logstash filter
ELK filebeat的安装
 使用 Python 验证数据集中的体温是否符合正态分布

原文地址：https://www.cnblogs.com/Orgliny/p/5520292.html