zoukankan      html  css  js  c++  java
  • elasticsearch(v2.4.6)添加中文分词器ik


    一、 参考

    ik github文档

    将maven源改为国内阿里云镜像

    二、 编译安装 analysis-ik

    2.1 下载源码

    git clone --depth 1 --branch v1.10.6 https://github.com/medcl/elasticsearch-analysis-ik.git

    因为ES2.4.6对应的ik v1.10.6,所以仅仅clonetag源码

    2.2 编译

    (1) 下载安装 maven

    # 源码下载
    wget https://mirror.olnevhost.net/pub/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
    
    # 解压目录
    mkdir /usr/local/maven
    
    tar -zxvf apache-maven-3.6.3-bin.tar.gz --directory /usr/local/maven
    
    # 环境变量设置
    export JAVA_HOME=/home/java/jdk1.8.0_131
    MAVEN_HOME=/usr/local/maven/apache-maven-3.6.3
    export MAVEN_HOME
    
    export PATH=$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
    
    source /etc/profile
    
    # 查看版本信息
    mvn --version
    
    

    (2) 编译ik

    # 编译
    cd elasticsearch-analysis-ik/
    
    mvn package
    
    # 将编译文件添加到plugins
    
    cd cd target/releases/
    
    cp elasticsearch-analysis-ik-1.10.6.zip /home/elastic/elasticsearch-2.4.6/plugins/ik/
    
    cd /home/elastic/elasticsearch-2.4.6/plugins/
    
    unzip elasticsearch-analysis-ik-1.10.6.zip
    
    

    2.3 重启es服务

    三、测试ik分词效果

    3.1 内置的中文分词

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"text": "正是江南好风景"
    }
    
    # 返回
    {
      "tokens": [
        {
          "token": "正",
          "start_offset": 0,
          "end_offset": 1,
          "type": "<IDEOGRAPHIC>",
          "position": 0
        },
        {
          "token": "是",
          "start_offset": 1,
          "end_offset": 2,
          "type": "<IDEOGRAPHIC>",
          "position": 1
        },
        {
          "token": "江",
          "start_offset": 2,
          "end_offset": 3,
          "type": "<IDEOGRAPHIC>",
          "position": 2
        },
        {
          "token": "南",
          "start_offset": 3,
          "end_offset": 4,
          "type": "<IDEOGRAPHIC>",
          "position": 3
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "<IDEOGRAPHIC>",
          "position": 4
        },
        {
          "token": "风",
          "start_offset": 5,
          "end_offset": 6,
          "type": "<IDEOGRAPHIC>",
          "position": 5
        },
        {
          "token": "景",
          "start_offset": 6,
          "end_offset": 7,
          "type": "<IDEOGRAPHIC>",
          "position": 6
        }
      ]
    }
    
    

    3.2 ik的ik_max_word分词器

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"analyzer": "ik_max_word",
    	"text": "正是江南好风景"
    }
    
    
    # 返回
    {
      "tokens": [
        {
          "token": "正是",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "江南",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "江",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "南",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_CHAR",
          "position": 3
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 4
        },
        {
          "token": "风景",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "景",
          "start_offset": 6,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 6
        }
      ]
    }
    
    

    3.3 ik的ik_smart分词器

    # 请求
    
    GET http://127.0.0.1:9200/_analyze
    {
    	"analyzer": "ik_smart",
    	"text": "正是江南好风景"
    }
    
    # 返回
    {
      "tokens": [
        {
          "token": "正是",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "江南",
          "start_offset": 2,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "好",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "风景",
          "start_offset": 5,
          "end_offset": 7,
          "type": "CN_WORD",
          "position": 3
        }
      ]
    }
    
    

    3.4 比较结果

    (1) 默认的分词器将中文按照一个个汉字来分词,肯定不符合大部分使用场景

    (2) ik_max_word会作最细粒度的分词,而ik_smart则正相反,会作最粗粒度的分词

  • 相关阅读:
    给xml某个节点赋值
    把datatable的某些数据提取出来放在另一个表中
    投资技巧:抛股票有技巧 常用方法介绍
    jquery的实用技巧,非常实用
    我觉得需要关注和跟进的一些.net技术
    公司网站的架构
    uboot移植经历
    ARM处理器中CP15协处理器的寄存器
    uboot 学习 Makefile分析
    uboot移植
  • 原文地址:https://www.cnblogs.com/thewindyz/p/14052829.html
Copyright © 2011-2022 走看看