zoukankan      html  css  js  c++  java
  • Elasticsearch集群使用ik分词器

    IK分词插件的安装

    ES集群环境

    • VMWare下三台虚拟机Ubuntu 14.04.2 LTS
    • JDK 1.8.0_66
    • Elasticsearch 2.3.1
    • elasticsearch-jdbc-2.3.1.0
    • IK分词器1.9.1
    • clustername:my-application
      分配如下表:
      虚拟机 | IP | node-x
      ----|----
      search1 | 192.168.235.133 | node-1
      search2 |192.168.235.134 | node-2
      search3 |192.168.235.135 | node-3

    IK分词器下载与编译

    在github下载IK分词器zip包:
    https://github.com/myitroad/elasticsearch-analysis-ik
    解压后导入IntelliJ IDEA为maven工程。
    生成jar包
    使用IntelliJ IDEA maven的terminal工具,执行:

    mvn clean
    mvn compile
    mvn package
    

    在F:workspace_ideaelasticsearch-analysis-ik-master arget eleases生成:
    elasticsearch-analysis-ik-1.9.1.zip
    上传IK分词器
    将上述zip包上传Elasticsearch的node-x(择一即可,如node-1),解压到:
    /home/es/cluster/elasticsearch-2.3.1/plugins/ik目录,
    最终的ik文件夹内目录为:

    ik
    │   ├── commons-codec-1.9.jar
    │   ├── commons-logging-1.2.jar
    │   ├── config
    │   │   └── ik
    │   │       ├── custom
    │   │       │   ├── ext_stopword.dic
    │   │       │   ├── mydict.dic
    │   │       │   ├── single_word.dic
    │   │       │   ├── single_word_full.dic
    │   │       │   ├── single_word_low_freq.dic
    │   │       │   └── sougou.dic
    │   │       ├── IKAnalyzer.cfg.xml
    │   │       ├── main.dic
    │   │       ├── preposition.dic
    │   │       ├── quantifier.dic
    │   │       ├── stopword.dic
    │   │       ├── suffix.dic
    │   │       └── surname.dic
    │   ├── elasticsearch-analysis-ik-1.9.1.jar
    │   ├── httpclient-4.4.1.jar
    │   ├── httpcore-4.4.1.jar
    │   └── plugin-descriptor.properties
    

    配置词库(ik自带搜狗词库)
    配置:$ES_HOME/plugins/ik/config/ik/IKAnalyzer.cfg.xml
    添加以下配置:

    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/sougou.dic</entry>
    

    重启节点node-1

    测试IK分词效果

    默认_analyze分析命令可能造成中文乱码,因此对中文使用URL编码。
    %E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA是“我是中国人”的URL转码。
    若直接使用“我是中国人”测试分词,则可能会返回乱码。
    使用IK的ik_max_word最大分词

    es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_max_word&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'
    

    返回分词结果:

    {
      "tokens" : [ {
        "token" : "我是",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "我",
        "start_offset" : 0,
        "end_offset" : 1,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "是中国人",
        "start_offset" : 1,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "中国人",
        "start_offset" : 2,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 3
      }, {
        "token" : "中国",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "国人",
        "start_offset" : 3,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 5
      }, {
        "token" : "人",
        "start_offset" : 4,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 6
      } ]
    }
    

    使用IK的ik_smart最小分词

    es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_smart&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'
    

    返回:

    {
      "tokens" : [ {
        "token" : "我是",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "中国人",
        "start_offset" : 2,
        "end_offset" : 5,
        "type" : "CN_WORD",
        "position" : 1
      } ]
    }
    

    使用IK分词器导入MySQL数据

    建立myindex索引
    在node-1上执行:

    curl -XPUT 'localhost:9200/myindex?pretty'
    

    编写MySQL导入es脚本mysql-es-all.sh:(存放位置可任意)

    #!/bin/sh
    bin=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/bin
    lib=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/lib
    echo '
    {
        "type" : "jdbc",
        "jdbc" : {
            "locale" : "zh_CN",
            "statefile" : "statefile.json",
            "timezone" : "GMT+8",
            "autocommit" : true,
            "elasticsearch" : {
                "cluster" : "my-application",
                "host" : "192.168.235.133",
                "port" : "9300"
            },
            "index" : "myindex",
            "type" : "mytype",
            "url" : "jdbc:mysql://10.110.1.47:3306/ispider_data",
            "user" : "root",
            "password" : "xxx",
            "sql" : "select uuid as _id,title,content,release_time from JCY_VOICE_NEWS_INFO",
            "metrics" : {
                "enabled" : true,
                "interval" : "5m"
            },
            "index_settings" : {
                "index" : {
                    "number_of_shards" : 2,
                    "number_of_replicas" : 2
                }
            },
            "type_mapping": {
                "mytype" : {
                    "properties" : {
                        "title" : {
                            "type" : "string",
                            "store": "no",
                            "term_vector": "with_positions_offsets",
                            "analyzer": "ik_max_word",
                            "search_analyzer": "ik_max_word",
                            "include_in_all": "true"
                        },
                        "content" : {
                            "type" : "string",
                            "store": "no",
                            "term_vector": "with_positions_offsets",
                            "analyzer": "ik_max_word",
                            "search_analyzer": "ik_max_word",
                            "include_in_all": "true"
                        },
                        "release_time":{
                            "type":"date",
                            "store":"no",
                            "format":"YYYY-MM-dd HH:mm:ss",
                            "index":"not_analyzed",
                            "include_in_all":"true"
                        }
                    }
                }
            }
        }
    }
    ' | java 
        -cp "${lib}/*" 
        -Dlog4j.configurationFile=${bin}/log4j2.xml 
        org.xbib.tools.Runner 
        org.xbib.tools.JDBCImporter
    

    添加运行权限并运行脚本

    es@search1:~/cluster/elasticsearch-2.3.1$chmod +x mysql-es-all.sh
    es@search1:~/cluster/elasticsearch-2.3.1$./mysql-es-all.sh
    

    参考资料

  • 相关阅读:
    浅谈SQL Server 对于内存的管理
    【JSON解析】JSON解析
    SQLSERVER吞噬内存解决记录
    数据schemaAvro简介
    Windows命令查看文件MD5
    均分纸牌(贪心)
    an easy problem(贪心)
    导弹拦截问题(贪心)
    活动选择(贪心)
    整数区间(贪心)
  • 原文地址:https://www.cnblogs.com/myitroad/p/5434379.html
Copyright © 2011-2022 走看看