zoukankan      html  css  js  c++  java
  • elasticsearch安装ik分词器

    一、概要:


    1.es默认的分词器对中文支持不好,会分割成一个个的汉字。ik分词器对中文的支持要好一些,主要由两种模式:ik_smart和ik_max_word
    2.环境
    操作系统:centos
    es版本:6.0.0

    二、安装插件


    1.插件地址:https://github.com/medcl/elasticsearch-analysis-ik
    2.运行命令行:

    ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip

    运行完成后会发现多了以下文件:esroot 下的plugins和config文件夹多了analysis-ik目录。

    三、重启es


    1.查找es进程

    ps -ef | grep elastic

    2.终止进程
    从上面的结果可以看到es进程号是12776.
    执行命令:

    kill 12776

    3.启动es后台运行

    ./bin/sh elastic search –d

    提醒:重启es会重新分片,线上环境要注意了。

    四、测试


    1.使用ik_max_word分词

    GET _analyze 
    { 
       "analyzer":"ik_max_word",
       "text":"中华人民共和国国歌"
    }

    分词结果:

    {
       "tokens": [
         {
           "token": "中华人民共和国",
           "start_offset": 0,
           "end_offset": 7,
           "type": "CN_WORD",
           "position": 0
         },
         {
           "token": "中华人民",
           "start_offset": 0,
           "end_offset": 4,
           "type": "CN_WORD",
           "position": 1
         },
         {
           "token": "中华",
           "start_offset": 0,
           "end_offset": 2,
           "type": "CN_WORD",
           "position": 2
         },
         {
           "token": "华人",
           "start_offset": 1,
           "end_offset": 3,
           "type": "CN_WORD",
           "position": 3
         },
         {
           "token": "人民共和国",
           "start_offset": 2,
           "end_offset": 7,
           "type": "CN_WORD",
           "position": 4
         },
         {
           "token": "人民",
           "start_offset": 2,
           "end_offset": 4,
           "type": "CN_WORD",
           "position": 5
         },
         {
           "token": "共和国",
           "start_offset": 4,
           "end_offset": 7,
           "type": "CN_WORD",
           "position": 6
         },
         {
           "token": "共和",
           "start_offset": 4,
           "end_offset": 6,
           "type": "CN_WORD",
           "position": 7
         },
         {
           "token": "",
           "start_offset": 6,
           "end_offset": 7,
           "type": "CN_CHAR",
           "position": 8
         },
         {
           "token": "国歌",
           "start_offset": 7,
           "end_offset": 9,
           "type": "CN_WORD",
           "position": 9
         }
       ]
    }

    2.使用ik_smart分词

    GET _analyze 
    { 
       "analyzer":"ik_smart",
       "text":"中华人民共和国国歌"
    }

    分词结果:

    {
       "tokens": [
         {
           "token": "中华人民共和国",
           "start_offset": 0,
           "end_offset": 7,
           "type": "CN_WORD",
           "position": 0
         },
         {
           "token": "国歌",
           "start_offset": 7,
           "end_offset": 9,
           "type": "CN_WORD",
           "position": 1
         }
       ]
    }

    五、java api分词测试

    1.调用ik_max_word分词

    @Test
    public void analyzer_ik_max_word() throws Exception {
         java.lang.String text = "提前祝大家春节快乐!";
    
        TransportClient client = EsClient.get();
         AnalyzeRequest request = (new AnalyzeRequest()).analyzer("ik_max_word").text(text);
         List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(request).actionGet().getTokens();
         System.out.println(tokens.size());//6
         for (AnalyzeResponse.AnalyzeToken token : tokens) {
             System.out.println(token.getTerm() + " ");
         }
    }

    结果:

    6
    提前 
    祝 
    大家 
    春节快乐 
    春节 
    快乐

    2.调用ik_smart分词

    @Test
    public void analyzer_ik_smart() throws Exception {
         java.lang.String text = "提前祝大家春节快乐!";
    
        TransportClient client = EsClient.get();
         AnalyzeRequest request = (new AnalyzeRequest()).analyzer("ik_smart").text(text);
         List<AnalyzeResponse.AnalyzeToken> tokens = client.admin().indices().analyze(request).actionGet().getTokens();
         System.out.println(tokens.size());
         for (AnalyzeResponse.AnalyzeToken token : tokens) {
             System.out.println(token.getTerm() + " ");
         }
    }

    结果:

    4
    提前 
    祝 
    大家 
    春节快乐
  • 相关阅读:
    Java虚拟机
    Java集合常见面试题一
    5个新自动化测试框架,你值得了解
    Mock工具介绍
    空降,如何做好管理?
    QA在业务变动中如何维护测试用例?
    接口越权扫描平台初探
    程序员删代码泄愤,被判刑5个月,网友:年轻人不讲武德?!
    哪些 Python 库让你相见恨晚?
    两篇毕业论文致谢同一个女朋友?哈哈哈哈!
  • 原文地址:https://www.cnblogs.com/janes/p/8393634.html
Copyright © 2011-2022 走看看