zoukankan      html  css  js  c++  java
  • 解析器

    IK分词器

    各平台配置IK分词器

    elasticsearch-analysis-ik releases:https://github.com/medcl/elasticsearch-analysis-ik/releases

    • Windows:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
    • Mac:在elasticsearch安装目录中的plugins目录内新建ik目录,将从GitHub下载的压缩包解压到ik目录内即可:
    • Centos:cd到elasticsearch安装目录的plugings目录,下载并解压:
    [root@cs ik]# wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.5.4/elasticsearch-analysis-ik-6.5.4.zip
    [root@cs ik]# unzip  elasticsearch-analysis-ik-6.5.4.zip
    [root@cs ik]# ll
    total 5832
    -rw-r--r--. 1 root root  263965 May  6  2018 commons-codec-1.9.jar
    -rw-r--r--. 1 root root   61829 May  6  2018 commons-logging-1.2.jar
    drwxr-xr-x. 2 root root    4096 Aug 26  2018 config
    -rw-r--r--. 1 root root   54693 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.jar
    -rw-r--r--. 1 root root 4504539 Dec 23  2018 elasticsearch-analysis-ik-6.5.4.zip
    -rw-r--r--. 1 root root  736658 May  6  2018 httpclient-4.5.2.jar
    -rw-r--r--. 1 root root  326724 May  6  2018 httpcore-4.4.4.jar
    -rw-r--r--. 1 root root    1805 Dec 23  2018 plugin-descriptor.properties
    -rw-r--r--. 1 root root     125 Dec 23  2018 plugin-security.policy
    

    测试

    • 首先将elascticsearch和kibana服务重启。
    • 然后浏览器地址栏输入http://localhost:5601(kibana地址),在Dev Tools中的Console界面的左侧输入命令,再点击绿色的执行按钮执行。
    GET _analyze
    {
      "analyzer": "ik_max_word",
      "text": "学不学的会靠天收"
    }
    

    ik目录简介

    我们简要的介绍一下ik分词配置文件:

    • IKAnalyzer.cfg.xml,用来配置自定义的词库
    • main.dic,ik原生内置的中文词库,大约有27万多条,只要是这些单词,都会被分在一起。
    • surname.dic,中国的姓氏。
    • suffix.dic,特殊(后缀)名词,例如乡、江、所、省等等。
    • preposition.dic,中文介词,例如不、也、了、仍等等。
    • stopword.dic,英文停用词库,例如a、an、and、the等。
    • quantifier.dic,单位名词,如厘米、件、倍、像素等。

    其他解析器

    Simple Analyzer – 按照非字母切分(符号等被过滤),小写处理

    GET _analyze
    {
      "analyzer": "simple",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    

    执行结果

    {
      "tokens" : [
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "in",
          "start_offset" : 48,
          "end_offset" : 50,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "the",
          "start_offset" : 51,
          "end_offset" : 54,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    

    Stop Analyzer – 小写处理,停用词过滤(the,a,is)

    GET _analyze
    {
      "analyzer": "stop",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    

    执行结果

    {
      "tokens" : [
        {
          "token" : "running",
          "start_offset" : 2,
          "end_offset" : 9,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 10,
          "end_offset" : 15,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "brown",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "foxes",
          "start_offset" : 22,
          "end_offset" : 27,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "leap",
          "start_offset" : 28,
          "end_offset" : 32,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 33,
          "end_offset" : 37,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "lazy",
          "start_offset" : 38,
          "end_offset" : 42,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "dogs",
          "start_offset" : 43,
          "end_offset" : 47,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "summer",
          "start_offset" : 55,
          "end_offset" : 61,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "evening",
          "start_offset" : 62,
          "end_offset" : 69,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    

    剩下的解析器依次为

    Whitespace Analyzer – 按照空格切分,不转小写

    Keyword Analyzer – 不分词,直接将输入当作输出

    Patter Analyzer – 正则表达式,默认 W+ (非字符分隔)

    Language – 提供了30多种常见语言的分词器

    #查看不同的analyzer的效果
    #standard
    GET _analyze
    {
      "analyzer": "standard",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #simpe
    GET _analyze
    {
      "analyzer": "simple",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    GET _analyze
    {
      "analyzer": "stop",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    #stop
    GET _analyze
    {
      "analyzer": "whitespace",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #keyword
    GET _analyze
    {
      "analyzer": "keyword",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    GET _analyze
    {
      "analyzer": "pattern",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    #english
    GET _analyze
    {
      "analyzer": "english",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "他说的确实在理”"
    }
    
    
    POST _analyze
    {
      "analyzer": "standard",
      "text": "他说的确实在理”"
    }
    
    
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "这个苹果不大好吃"
    }
    
    
    自律人的才是可怕的人
  • 相关阅读:
    初识软件工程
    00.JS前言
    01.JS语法规范、变量与常量
    02.JS数据类型与数据类型转换
    03.JS运算符
    04.JS逻辑结构
    05.JS函数
    06.JS对象-1
    08.JS单词整理
    00.ES6简介
  • 原文地址:https://www.cnblogs.com/lovelifest/p/14325480.html
Copyright © 2011-2022 走看看