zoukankan      html  css  js  c++  java
  • ElasticSearch教程——自定义分词器(转学习使用)

    一、分词器

    Elasticsearch中,内置了很多分词器(analyzers),例如standard(标准分词器)、english(英文分词)和chinese(中文分词),默认是standard.

    standard tokenizer:以单词边界进行切分
    standard token filter:什么都不做
    lowercase token filter:将所有字母转换为小写
    stop token filer(默认被禁用):移除停用词,比如a the it等等

    二、修改分词器设置

    启用english,停用词token filter

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "es_std":{
              "type":"standard",
              "stopwords":"_english_"
            }
          }
        }
      }
    }

    三、标准分词测试代码

    GET /my_index/_analyze
    {
      "analyzer": "standard",
      "text":"a dog is in the house"
    }

    结果:

    {
      "tokens": [
        {
          "token": "a",
          "start_offset": 0,
          "end_offset": 1,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "is",
          "start_offset": 6,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 2
        },
        {
          "token": "in",
          "start_offset": 9,
          "end_offset": 11,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "the",
          "start_offset": 12,
          "end_offset": 15,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    四、设置的英文分词测试代码

    GET /my_index/_analyze
    {
    
      "analyzer": "es_std",
    
      "text":"a dog is in the house"
    
    }

    结果:

    {
      "tokens": [
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "<ALPHANUM>",
          "position": 5
        }
      ]
    }

    五、自定义分词器

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "&_to_and": {
              "type": "mapping",
              "mappings": ["&=> and"]
            }
          },
          "filter": {
            "my_stopwords": {
              "type": "stop",
              "stopwords": ["the", "a"]
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": ["html_strip", "&_to_and"],
              "tokenizer": "standard",
              "filter": ["lowercase", "my_stopwords"]
            }
          }
        }
      }
    }

    测试:

    GET /my_index/_analyze
    {
      "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
      "analyzer": "my_analyzer"
    }

    结果:

    {
      "tokens": [
        {
          "token": "tomandjerry",
          "start_offset": 0,
          "end_offset": 9,
          "type": "<ALPHANUM>",
          "position": 0
        },
        {
          "token": "are",
          "start_offset": 10,
          "end_offset": 13,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "friend",
          "start_offset": 16,
          "end_offset": 22,
          "type": "<ALPHANUM>",
          "position": 3
        },
        {
          "token": "in",
          "start_offset": 23,
          "end_offset": 25,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 30,
          "end_offset": 35,
          "type": "<ALPHANUM>",
          "position": 6
        },
        {
          "token": "haha",
          "start_offset": 42,
          "end_offset": 46,
          "type": "<ALPHANUM>",
          "position": 7
        }
      ]
    }

    六、type中的使用

    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  • 相关阅读:
    SQL Server 2008 安装过程中遇到“性能计数器注册表配置单元一致性”检查失败 问题的解决方法
    备份还原工具—ghost
    太多的if,太多的痛苦
    C#中使用GUID
    WinForm开发中,将Excel文件导入到DataGridView中时,获取Excel中所有表格的名称。
    使用ASP调用C#写的COM+组件
    COM+ and the .NET Framework 虽是英文但比较全面
    在C#中使用COM+实现事务控制
    COM+ and the .NET Framework
    管理员ID过期,无人能够管理Domino服务器
  • 原文地址:https://www.cnblogs.com/yfb918/p/10718712.html
Copyright © 2011-2022 走看看