zoukankan      html  css  js  c++  java
  • es

    Es 内置分词器

    • Standard Analyer 默认分词器,按词切分,小写处理
    • Simple Analyer 按照非字母切分(符号被过滤),小写处理
    • Stop Analyer 小写处理,停用过滤词(the, is , a)
    • Whitespace Analyer 按照空格切分,不转小写
    • Keyword Analyer 不分词,直接将输入当作输出
    • Pattern Analyer 正则表达式,默认 W+(非字符分隔)
    • Language 提供30种分词器
    • Customer Analyzer 自定义分词器

    Standard Analyer 默认分词器

    按词切分,小写处理

    GET /_analyze
    {
      "analyzer": "standard",
      "text": "Trying Out Kibana! "
    }
    
    结果
    {
      "tokens" : [
        {
          "token" : "trying",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "out",
          "start_offset" : 7,
          "end_offset" : 10,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "kibana",
          "start_offset" : 11,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    }
    
    

    Simple Analyer

    按照非字母切分(符号被过滤),小写处理

    GET /_analyze
    {
      "analyzer": "simple",
      "text": "Try78ing 12 Out 1212 Kib45ana! "
    }
    
    结果
    {
      "tokens" : [
        {
          "token" : "try",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "ing",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "out",
          "start_offset" : 12,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "kib",
          "start_offset" : 21,
          "end_offset" : 24,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "ana",
          "start_offset" : 26,
          "end_offset" : 29,
          "type" : "word",
          "position" : 4
        }
      ]
    }
    
    

    Simple Analyer

    按照非字母切分(符号被过滤),小写处理

    GET /_analyze
    {
      "analyzer": "stop",
      "text": "Try78ing 12 Out 1212 Kib45ana! "
    }
    
    
    结果
    
    {
      "tokens" : [
        {
          "token" : "try",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "ing",
          "start_offset" : 5,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "out",
          "start_offset" : 12,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "kib",
          "start_offset" : 21,
          "end_offset" : 24,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "ana",
          "start_offset" : 26,
          "end_offset" : 29,
          "type" : "word",
          "position" : 4
        }
      ]
    }
    
    

    Whitespace Analyer

    按照空格切分,不转小写

    GET /_analyze
    {
      "analyzer": "whitespace",
      "text": "Try78ing 12 Out 1212 Kib45ana! "
    }
    
    结果
    {
      "tokens" : [
        {
          "token" : "Try78ing",
          "start_offset" : 0,
          "end_offset" : 8,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "12",
          "start_offset" : 9,
          "end_offset" : 11,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "Out",
          "start_offset" : 12,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "1212",
          "start_offset" : 16,
          "end_offset" : 20,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "Kib45ana!",
          "start_offset" : 21,
          "end_offset" : 30,
          "type" : "word",
          "position" : 4
        }
      ]
    }
    
    
    

    Keyword Analyer

    不分词,直接将输入当作输出

    GET /_analyze
    {
      "analyzer": "whitespace",
      "text": "Try78ing 12 Out 1212 Kib45ana! "
    }
    结果
    {
      "tokens" : [
        {
          "token" : "Try78ing 12 Out 1212 Kib45ana! ",
          "start_offset" : 0,
          "end_offset" : 31,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    

    Pattern Analyer

    正则表达式,默认 W+(非字符分隔)

    GET /_analyze
    {
      "analyzer": "whitespace",
      "text": "Try78ing 12 Out 1212 Kib45ana! "
    }
    结果
    {
      "tokens" : [
        {
          "token" : "try78ing",
          "start_offset" : 0,
          "end_offset" : 8,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "12",
          "start_offset" : 9,
          "end_offset" : 11,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "out",
          "start_offset" : 12,
          "end_offset" : 15,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "1212",
          "start_offset" : 16,
          "end_offset" : 20,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "kib45ana",
          "start_offset" : 21,
          "end_offset" : 29,
          "type" : "word",
          "position" : 4
        }
      ]
    }
    
    
    

    Language 提供30种分词器

    Customer Analyzer

    自定义分词器

  • 相关阅读:
    归并排序
    二分查找
    分治 递归 引用 求一个数组中的最大和最小元素
    插入排序
    Poj 2503
    SELinux 基础命令
    Zend Framework中的MVC架构
    phpfpm详解
    CentOS 6 minimal 安装
    php 5.3.3 中的phpfpm配置
  • 原文地址:https://www.cnblogs.com/smallyi/p/13430614.html
Copyright © 2011-2022 走看看