zoukankan      html  css  js  c++  java
  • ElasticSearch教程——自定义分词器(转学习使用)

    一、分词器

    Elasticsearch中,内置了很多分词器(analyzers),例如standard(标准分词器)、english(英文分词)和chinese(中文分词),默认是standard.

    standard tokenizer:以单词边界进行切分
    standard token filter:什么都不做
    lowercase token filter:将所有字母转换为小写
    stop token filer(默认被禁用):移除停用词,比如a the it等等

    二、修改分词器设置

    启用english,停用词token filter

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "es_std":{
              "type":"standard",
              "stopwords":"_english_"
            }
          }
        }
      }
    }

    三、标准分词测试代码

    GET /my_index/_analyze
    {
      "analyzer": "standard",
      "text":"a dog is in the house"
    }

    结果:

    {
      "tokens": [
        {
          "token": "a",
          "start_offset": 0,
          "end_offset": 1,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "is",
          "start_offset": 6,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "in",
          "start_offset": 9,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "the",
          "start_offset": 12,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    四、设置的英文分词测试代码

    GET /my_index/_analyze
    {
    
      "analyzer": "es_std",
    
      "text":"a dog is in the house"
    
    }

    结果:

    {
      "tokens": [
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    五、自定义分词器

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "&_to_and": {
              "type": "mapping",
              "mappings": ["&=> and"]
            }
          },
          "filter": {
            "my_stopwords": {
              "type": "stop",
              "stopwords": ["the", "a"]
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": ["html_strip", "&_to_and"],
              "tokenizer": "standard",
              "filter": ["lowercase", "my_stopwords"]
            }
          }
        }
      }
    }

    测试:

    GET /my_index/_analyze
    {
      "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
      "analyzer": "my_analyzer"
    }

    结果:

    {
      "tokens": [
        {
          "token": "tomandjerry",
          "start_offset": 0,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "are",
          "start_offset": 10,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "friend",
          "start_offset": 16,
          "end_offset": 22,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "in",
          "start_offset": 23,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 30,
          "end_offset": 35,
          "type": "<ALPHANUM>",
          "position": 6
        },
        {
          "token": "haha",
          "start_offset": 42,
          "end_offset": 46,
          "type": "<ALPHANUM>",
          "position": 7
        }
      ]
    }

    六、type中的使用

    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  • 相关阅读:
    .net core 3.1 使用Redis缓存
    JavaSE 高级 第11节 缓冲输入输出字节流
    JavaSE 高级 第10节 字节数组输出流ByteArrayOutputStream
    JavaSE 高级 第09节 字节数组输入流ByteArrayInputStream
    JavaSE 高级 第08节 文件输出流FileOutputStream
    JavaSE 高级 第07节 文件输入流FileInputStream
    JavaSE 高级 第06节 初识I、O流
    JavaSE 高级 第05节 日期类与格式化
    JavaSE 高级 第04节 StringBuffer类
    JavaSE 高级 第03节 Math类与猜数字游戏
  • 原文地址:https://www.cnblogs.com/yfb918/p/10718712.html
Copyright © 2011-2022 走看看