zoukankan      html  css  js  c++  java
  • Analyzer分词器

    一、Elasticsearch内置分词器

    #Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
    #Stop Analyzer – 小写处理,停用词过滤(the,a,is)
    #Whitespace Analyzer – 按照空格切分,不转小写
    #Keyword Analyzer – 不分词,直接将输入当作输出
    #Patter Analyzer – 正则表达式,默认 W+ (非字符分隔)
    #Language – 提供了30多种常见语言的分词器

     1,Standard Analyzer

    2, Simple Analyzer

     3,Whitespace Analyzer

     4,Stop Analyzer

      5,Keywork Analyzer

     6,Pattern Analyzer

    #Simple Analyzer – 按照非字母切分(符号被过滤),小写处理
    #Stop Analyzer – 小写处理,停用词过滤(the,a,is)
    #Whitespace Analyzer – 按照空格切分,不转小写
    #Keyword Analyzer – 不分词,直接将输入当作输出
    #Patter Analyzer – 正则表达式,默认 W+ (非字符分隔)
    #Language – 提供了30多种常见语言的分词器
    #2 running Quick brown-foxes leap over lazy dogs in the summer evening
    
    #查看不同的analyzer的效果
    #standard
    GET _analyze
    {
      "analyzer": "standard",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #simpe
    GET _analyze
    {
      "analyzer": "simple",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    GET _analyze
    {
      "analyzer": "stop",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    #stop
    GET _analyze
    {
      "analyzer": "whitespace",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    #keyword
    GET _analyze
    {
      "analyzer": "keyword",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    GET _analyze
    {
      "analyzer": "pattern",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    #english
    GET _analyze
    {
      "analyzer": "english",
      "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
    }
    
    
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "他说的确实在理”"
    }
    
    
    POST _analyze
    {
      "analyzer": "standard",
      "text": "他说的确实在理”"
    }
    
    
    POST _analyze
    {
      "analyzer": "icu_analyzer",
      "text": "这个苹果不大好吃"
    }
    demo

    二、中文分词 ICU Analyzer

    //直接指定analyze进行测试
    GET _analyze
    {
      "analyzer":"icu_analyzer",
      "text":"你好中国"
    }

    2,其他中文分词插件

    三、自定义Analyzer

     

     

    //自定义分词器
    PUT /my_index
    {
        "settings": {
            "analysis": {
                "char_filter": {
                    "&_to_and": {
                        "type": "mapping",
                        "mappings": [ "&=> and "]
                }},
                "filter": {
                    "my_stopwords": {
                        "type": "stop",
                        "stopwords": [ "the", "a" ]
                }},
                "analyzer": {
                    "my_analyzer": {
                        "type": "custom",
                        "char_filter": [ "html_strip", "&_to_and" ],
                        "tokenizer": "standard",
                        "filter": [ "lowercase", "my_stopwords" ]
                }}
    }}}
    
    //设置mapping
    PUT /my_index/_mapping
    {
       "properties":{
            "username":{
                 "type":"text",
                 "analyzer" : "my_analyzer"
             },
            "password" : {
              "type" : "text"
            }
        
      }
    }
    
    //插入数据
    PUT /my_index/_doc/1
    {
      "username":"The quick & brown fox ",
      "password":"The quick & brown fox "
    }
    
    //验证
    GET my_index/_analyze
    {
      "field":"username",
      "text":"The quick & brown fox"
    }
    
    GET my_index/_analyze
    {
      "field":"password",
      "text":"The quick & brown fox"
    }

    四、ik分词插件

    1,下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases

    需要下载与el版本一致的分词器版本

    2,plugins文件夹下面创建一个analysis-ik目录

    3,将下载的zip文件copy到analysis-ik目录下,执行unzip

    4,运行es

    //ik_max_word
    //ik_smart
    POST _analyze
    {
      "analyzer": "ik_max_word",
      "text": ["剑桥分析公司多位高管对卧底记者说,他们确保了唐纳德·特朗普在总统大选中获胜"]
    } 

    hanlp分词插件

    1,下载地址:https://github.com/KennFalcon/elasticsearch-analysis-hanlp

  • 相关阅读:
    02_离线计算系统_第2天(HDFS详解)
    01_离线计算系统_第1天(HADOOP快速入门)
    01_离线计算系统_第1天(HADOOP快速入门)
    第4天 java高级特性增强 ---有用 第一遍
    038_字符串的转义
    037_标准化日期代码
    036_js中的字符串比较大小
    035_jQaury中的each()循环
    034_json对象字符串长什么样子?
    033_SpringMVC返回String,view,Object的原理
  • 原文地址:https://www.cnblogs.com/zd1994/p/12650261.html
Copyright © 2011-2022 走看看