zoukankan      html  css  js  c++  java
  • ElasticSearch教程——自定义分词器(转学习使用)

    一、分词器

    Elasticsearch中,内置了很多分词器(analyzers),例如standard(标准分词器)、english(英文分词)和chinese(中文分词),默认是standard.

    standard tokenizer:以单词边界进行切分
    standard token filter:什么都不做
    lowercase token filter:将所有字母转换为小写
    stop token filer(默认被禁用):移除停用词,比如a the it等等

    二、修改分词器设置

    启用english,停用词token filter

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "es_std":{
              "type":"standard",
              "stopwords":"_english_"
            }
          }
        }
      }
    }

    三、标准分词测试代码

    GET /my_index/_analyze
    {
      "analyzer": "standard",
      "text":"a dog is in the house"
    }

    结果:

    {
      "tokens": [
        {
          "token": "a",
          "start_offset": 0,
          "end_offset": 1,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "is",
          "start_offset": 6,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "in",
          "start_offset": 9,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "the",
          "start_offset": 12,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    四、设置的英文分词测试代码

    GET /my_index/_analyze
    {
    
      "analyzer": "es_std",
    
      "text":"a dog is in the house"
    
    }

    结果:

    {
      "tokens": [
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    五、自定义分词器

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "&_to_and": {
              "type": "mapping",
              "mappings": ["&=> and"]
            }
          },
          "filter": {
            "my_stopwords": {
              "type": "stop",
              "stopwords": ["the", "a"]
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": ["html_strip", "&_to_and"],
              "tokenizer": "standard",
              "filter": ["lowercase", "my_stopwords"]
            }
          }
        }
      }
    }

    测试:

    GET /my_index/_analyze
    {
      "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
      "analyzer": "my_analyzer"
    }

    结果:

    {
      "tokens": [
        {
          "token": "tomandjerry",
          "start_offset": 0,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "are",
          "start_offset": 10,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "friend",
          "start_offset": 16,
          "end_offset": 22,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "in",
          "start_offset": 23,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 30,
          "end_offset": 35,
          "type": "<ALPHANUM>",
          "position": 6
        },
        {
          "token": "haha",
          "start_offset": 42,
          "end_offset": 46,
          "type": "<ALPHANUM>",
          "position": 7
        }
      ]
    }

    六、type中的使用

    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  • 相关阅读:
    Text Link Ads 注册[赚钱一]
    Linux文件系统中的链接
    C++虚函数和纯虚函数(1)
    Android init reading tips
    Android上GDB的使用
    What is prelink?
    Linux fork哪些被继承,哪些不被继承
    为什么x86 Linux程序起始地址是从0x08048000开始的?
    Android应用开发的插件化 模块化
    C++拷贝构造函数(深拷贝、浅拷贝)
  • 原文地址:https://www.cnblogs.com/yfb918/p/10718712.html
Copyright © 2011-2022 走看看