zoukankan      html  css  js  c++  java
  • elasticsearch之集成中文分词器

    IK是基于字典的一款轻量级的中文分词工具包,可以通过elasticsearch的插件机制集成;
    一、集成步骤

    1.在elasticsearch的安装目录下的plugin下新建ik目录;

    2.在github下载对应版本的ik插件;

    https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.12
    

    3.解压插件文件,并重启elasticsearch,可以看到如下已经加载了ik插件;

    [2022-01-11T15:22:54,341][INFO ][o.e.p.PluginsService     ] [4EvvJl1] loaded plugin [analysis-ik]
    

    二、体验IK的分析器

    IK提供了ik_smart和ik_max_word两个分析器;

    ik_max_word分析器会最大程度的对文本进行分词,分词的粒度还是比较细致的;

    POST _analyze
    {
      "analyzer": "ik_max_word",
      "text":"这次出差我们住的是闫团如家快捷酒店"
    }
    
    
    {
      "tokens" : [
        {
          "token" : "这次",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "出差",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "我们",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "住",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "CN_CHAR",
          "position" : 3
        },
        {
          "token" : "的",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "CN_CHAR",
          "position" : 4
        },
        {
          "token" : "是",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "CN_CHAR",
          "position" : 5
        },
        {
          "token" : "闫",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "CN_CHAR",
          "position" : 6
        },
        {
          "token" : "团",
          "start_offset" : 10,
          "end_offset" : 11,
          "type" : "CN_CHAR",
          "position" : 7
        },
        {
          "token" : "如家",
          "start_offset" : 11,
          "end_offset" : 13,
          "type" : "CN_WORD",
          "position" : 8
        },
        {
          "token" : "快捷酒店",
          "start_offset" : 13,
          "end_offset" : 17,
          "type" : "CN_WORD",
          "position" : 9
        }
      ]
    }
    
    
    
    

    ik_smart相对来说粒度会比较粗;

    POST _analyze
    {
      "analyzer": "ik_smart",
      "text":"这次出差我们住的是闫团如家快捷酒店"
    }
    
    {
      "tokens" : [
        {
          "token" : "这次",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "出差",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "我们",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "住",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "CN_CHAR",
          "position" : 3
        },
        {
          "token" : "的",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "CN_CHAR",
          "position" : 4
        },
        {
          "token" : "是",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "CN_CHAR",
          "position" : 5
        },
        {
          "token" : "闫",
          "start_offset" : 9,
          "end_offset" : 10,
          "type" : "CN_CHAR",
          "position" : 6
        },
        {
          "token" : "团",
          "start_offset" : 10,
          "end_offset" : 11,
          "type" : "CN_CHAR",
          "position" : 7
        },
        {
          "token" : "如家",
          "start_offset" : 11,
          "end_offset" : 13,
          "type" : "CN_WORD",
          "position" : 8
        },
        {
          "token" : "快捷酒店",
          "start_offset" : 13,
          "end_offset" : 17,
          "type" : "CN_WORD",
          "position" : 9
        }
      ]
    }
    
    

    三、扩展ik字典

    由于 闫团 是一个比较小的地方,ik的字典中并不包含导致分成两个单个的字符;我们可以将它添加到ik的字典中;

    在ik的安装目录下config中新增my.dic文件,并将 闫团 放到文件中;完成之后修改IKAnalyzer.cfg.xml文件,添加新增的字典文件;

    <properties>
    	<comment>IK Analyzer 扩展配置</comment>
    	<!--用户可以在这里配置自己的扩展字典 -->
    	<entry key="ext_dict">my.dic</entry>
    	 <!--用户可以在这里配置自己的扩展停止词字典-->
    	<entry key="ext_stopwords"></entry>
    	<!--用户可以在这里配置远程扩展字典 -->
    	<!-- <entry key="remote_ext_dict">words_location</entry> -->
    	<!--用户可以在这里配置远程扩展停止词字典-->
    	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    </properties>
    

    重启elasticsearch并重新执行查看已经将地名作为一个分词了;

    POST _analyze
    {
      "analyzer": "ik_smart",
      "text":"这次出差我们住的是闫团如家快捷酒店"
    }
    
    {
      "tokens" : [
        {
          "token" : "这次",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "CN_WORD",
          "position" : 0
        },
        {
          "token" : "出差",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "CN_WORD",
          "position" : 1
        },
        {
          "token" : "我们",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "CN_WORD",
          "position" : 2
        },
        {
          "token" : "住",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "CN_CHAR",
          "position" : 3
        },
        {
          "token" : "的",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "CN_CHAR",
          "position" : 4
        },
        {
          "token" : "是",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "CN_CHAR",
          "position" : 5
        },
        {
          "token" : "闫团",
          "start_offset" : 9,
          "end_offset" : 11,
          "type" : "CN_WORD",
          "position" : 6
        },
        {
          "token" : "如家",
          "start_offset" : 11,
          "end_offset" : 13,
          "type" : "CN_WORD",
          "position" : 7
        },
        {
          "token" : "快捷酒店",
          "start_offset" : 13,
          "end_offset" : 17,
          "type" : "CN_WORD",
          "position" : 8
        }
      ]
    }
    
    

    四、体验HanLP分析器及自定义字典

    HanLP是由一系列模型与算法组成的Java工具包,它从中文分词开始,覆盖词性标注、命名实体识别、句法分析、文本分类等常用的NLP任务,提供了丰富的API,被广泛用于Lucene、Solr和ES等搜索平台。就分词算法来说,它支持最短路分词、N-最短路分词和CRF分词等分词算法。

    从以下地址下载hanLP插件包

    https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.9.2/elasticsearch-analysis-hanlp-7.9.2.zip
    

    安装hanLP插件包

    bin\elasticsearch-plugin install file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
    -> Installing file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
    -> Downloading file:///c:/elasticsearch-analysis-hanlp-7.9.2.zip
    [=================================================] 100%??
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @     WARNING: plugin requires additional permissions     @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    * java.io.FilePermission plugins/analysis-hanlp/data/-#plus read,write,delete
    * java.io.FilePermission plugins/analysis-hanlp/hanlp.cache#plus read,write,delete
    * java.lang.RuntimePermission getClassLoader
    * java.lang.RuntimePermission setContextClassLoader
    * java.net.SocketPermission * connect,resolve
    * java.util.PropertyPermission * read,write
    See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
    for descriptions of what these permissions allow and the associated risks.
    
    Continue with installation? [y/N]y
    -> Installed analysis-hanlp
    
    

    使用hanlp_standard分析器对文本进行分析

    POST _analyze
    {
      "analyzer": "hanlp_standard",
      "text":"这次出差我们住的是闫团如家快捷酒店"
    }
    
    {
      "tokens" : [
        {
          "token" : "这次",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "r",
          "position" : 0
        },
        {
          "token" : "出差",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "vi",
          "position" : 1
        },
        {
          "token" : "我们",
          "start_offset" : 4,
          "end_offset" : 6,
          "type" : "rr",
          "position" : 2
        },
        {
          "token" : "住",
          "start_offset" : 6,
          "end_offset" : 7,
          "type" : "vi",
          "position" : 3
        },
        {
          "token" : "的",
          "start_offset" : 7,
          "end_offset" : 8,
          "type" : "ude1",
          "position" : 4
        },
        {
          "token" : "是",
          "start_offset" : 8,
          "end_offset" : 9,
          "type" : "vshi",
          "position" : 5
        },
        {
          "token" : "闫团",
          "start_offset" : 9,
          "end_offset" : 11,
          "type" : "nr",
          "position" : 6
        },
        {
          "token" : "如家",
          "start_offset" : 11,
          "end_offset" : 13,
          "type" : "r",
          "position" : 7
        },
        {
          "token" : "快捷酒店",
          "start_offset" : 13,
          "end_offset" : 17,
          "type" : "ntch",
          "position" : 8
        }
      ]
    }
    
    

    我们可以看到hanLP自动将 闫团 分成一个词了;

    执行如下测试,可以看到hanLP没有将 小地方作为一个分词;

    POST _analyze
    {
      "analyzer": "hanlp_standard",
      "text":"闫团是一个小地方"
    }
    
    {
      "tokens" : [
        {
          "token" : "闫团",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "nr",
          "position" : 0
        },
        {
          "token" : "是",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "vshi",
          "position" : 1
        },
        {
          "token" : "一个",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "mq",
          "position" : 2
        },
        {
          "token" : "小",
          "start_offset" : 5,
          "end_offset" : 6,
          "type" : "a",
          "position" : 3
        },
        {
          "token" : "地方",
          "start_offset" : 6,
          "end_offset" : 8,
          "type" : "n",
          "position" : 4
        }
      ]
    }
    
    

    为了自定义分词,我们在${ES_HOME}/plugins/analysis-hanlp/data/dictionary/custom下新建my.dic,并添加 小地方;

    然后从插件安装包拷贝hanlp.properties文件放到如下位置${ES_HOME}/config/analysis-hanlp/hanlp.properties,并修改CustomDictionaryPath;

    CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; ModernChineseSupplementaryWord.txt; ChinesePlaceName.txt ns; PersonalName.txt; OrganizationName.txt; ShanghaiPlaceName.txt ns;data/dictionary/person/nrf.txt nrf;data/dictionary/custom/my.dic;
    
    

    从起elasticsearch并执行测试

    POST _analyze
    {
      "analyzer": "hanlp",
      "text":"闫团是一个小地方"
    }
    
    {
      "tokens" : [
        {
          "token" : "闫团",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "nr",
          "position" : 0
        },
        {
          "token" : "是",
          "start_offset" : 2,
          "end_offset" : 3,
          "type" : "vshi",
          "position" : 1
        },
        {
          "token" : "一个",
          "start_offset" : 3,
          "end_offset" : 5,
          "type" : "mq",
          "position" : 2
        },
        {
          "token" : "小地方",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "n",
          "position" : 3
        }
      ]
    }
    
    
  • 相关阅读:
    Python之paramiko基础
    mysql-创建库之问题
    Jmeter--HTTP Cookie管理器
    Mysql-简单安装
    [JS]jQuery,javascript获得网页的高度和宽度
    去除表单自动填充时,-webkit浏览器默认给文本框加的黄色背景
    读罢泪两行,人生成长必须面对的10个残酷事实
    前端开发面试题JS2
    前端开发面试题JS
    是内耗让你活得特别累
  • 原文地址:https://www.cnblogs.com/wufengtinghai/p/15790472.html
Copyright © 2011-2022 走看看