zoukankan      html  css  js  c++  java
  • elasticsearch ik分词插件的扩展字典和扩展停止词字典用法

    本文引自 https://blog.csdn.net/caideb/article/details/81632154

    cnblog的排版好看很多,所以在这里建一篇分享博客。

    -----------------------------------------------------------------------------------------------

    扩展字典中的词会被筛选出来,扩展停止词中的词会被过滤掉

    1.没有加入扩展字典 停止词字典用法

    1) ik分词器

    [root@localhost custom]# curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty
    HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 725
    {
      "tokens" : [ {
        "token" : "自古",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "",
        "start_offset" : 4,
        "end_offset" : 5,
        "type" : "CN_CHAR",
        "position" : 3
      }, {
        "token" : "",
        "start_offset" : 5,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "",
        "start_offset" : 6,
        "end_offset" : 7,
        "type" : "CN_CHAR",
        "position" : 5
      } ]
    }

    2) ik_smart分词器

    [root@localhost custom]# curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik_smart","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty                   HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 725
    {
      "tokens" : [ {
        "token" : "自古",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "",
        "start_offset" : 4,
        "end_offset" : 5,
        "type" : "CN_CHAR",
        "position" : 3
      }, {
        "token" : "",
        "start_offset" : 5,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "",
        "start_offset" : 6,
        "end_offset" : 7,
        "type" : "CN_CHAR",
        "position" : 5
      } ]
    }

    3) ik_max_word分词器

    [root@localhost custom]# curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik_max_word","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty
    HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 725
    {
      "tokens" : [ {
        "token" : "自古",
        "start_offset" : 0,
        "end_offset" : 2,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "",
        "start_offset" : 4,
        "end_offset" : 5,
        "type" : "CN_CHAR",
        "position" : 3
      }, {
        "token" : "",
        "start_offset" : 5,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "",
        "start_offset" : 6,
        "end_offset" : 7,
        "type" : "CN_CHAR",
        "position" : 5
      } ]
    }

    2.加入自定义字典

    扩展字典:用于创建分词的字典

    停止字典:用于过滤的字典,也就是说,该字典的单词或者字符串都会进行过滤

    test.dic

    刀扇
    背刺

    teststop.dic

    自古
    过

    /analysis-ik/config/IKAnalyzer.cfg.xml

    1) ik分词器

    [root@localhost config]# curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty
    HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 728
    {
      "tokens" : [ {
        "token" : "刀扇",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "背刺",
        "start_offset" : 5,
        "end_offset" : 7,
        "type" : "CN_WORD",
        "position" : 3
      }, {
        "token" : "",
        "start_offset" : 5,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "",
        "start_offset" : 6,
        "end_offset" : 7,
        "type" : "CN_CHAR",
        "position" : 5
      } ]
    }

    2) ik_smart分词器

    [root@localhost config]#  curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik_smart","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty                  HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 260
    {
      "tokens" : [ {
        "token" : "刀扇",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "背刺",
        "start_offset" : 5,
        "end_offset" : 7,
        "type" : "CN_WORD",
        "position" : 1
      } ]
    }

    3) ik_max_word分词器

    [root@localhost config]#  curl -i -X GET -H 'Content-type:application/json' -d '{"analyzer":"ik_max_word","text":"自古刀扇过背刺"}' http://192.168.0.110:9200/_analyze?pretty
    HTTP/1.1 200 OK
    Content-Type: application/json; charset=UTF-8
    Content-Length: 728
    {
      "tokens" : [ {
        "token" : "刀扇",
        "start_offset" : 2,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 0
      }, {
        "token" : "",
        "start_offset" : 2,
        "end_offset" : 3,
        "type" : "CN_WORD",
        "position" : 1
      }, {
        "token" : "",
        "start_offset" : 3,
        "end_offset" : 4,
        "type" : "CN_WORD",
        "position" : 2
      }, {
        "token" : "背刺",
        "start_offset" : 5,
        "end_offset" : 7,
        "type" : "CN_WORD",
        "position" : 3
      }, {
        "token" : "",
        "start_offset" : 5,
        "end_offset" : 6,
        "type" : "CN_WORD",
        "position" : 4
      }, {
        "token" : "",
        "start_offset" : 6,
        "end_offset" : 7,
        "type" : "CN_CHAR",
        "position" : 5
      } ]
    }
  • 相关阅读:
    [转]读取并修改App.config文件
    [转]线程和进程的概念
    实习日志(3)
    实习日志2
    实习小感,回学校啦~~~~
    请教LUA高手一段代码,希望帮忙谢谢!
    实习的日子
    vs显示 error LNK2019: 无法解析的外部符号 _main解决办法
    创建一个新窗口进程并返回进程ID号和进程的主线程ID号
    显示基本图形界面第一天
  • 原文地址:https://www.cnblogs.com/geektcp/p/12263101.html
Copyright © 2011-2022 走看看