zoukankan      html  css  js  c++  java
  • 62.修改分词器及手动创建分词器

    主要知识点

    • 修改分词器
    • 手动创建分词器

    一、修改分词器

    1、默认的分词器standard,主要有以下四个功能

       

    • standard tokenizer:以单词边界进行切分
    • standard token filter:什么都不做
    • lowercase token filter:将所有字母转换为小写
    • stop token filer(默认被禁用):移除停用词,比如a the it等等

       

    2、修改分词器的设置

    启用english的停用词token filter

       

    PUT /my_index

    {

    "settings": {

    "analysis": {

    "analyzer": {

    "es_std": {

    "type": "standard",

    "stopwords": "_english_"

    }

    }

    }

    }

    }

    测试修改后的分词器

    GET /my_index/_analyze

    {

    "analyzer": "standard",

    "text": "a dog is in the house"

    }

       

    GET /my_index/_analyze

    {

    "analyzer": "es_std",

    "text":"a dog is in the house"

    }

       

    二、定制化自己的分词器

       

    PUT /my_index

    {

    "settings": {

    "analysis": {

    "char_filter": {

    "&_to_and": {

    "type": "mapping",

    "mappings": ["&=> and"]

    }

    },

    "filter": {

    "my_stopwords": {

    "type": "stop",

    "stopwords": ["the", "a"]

    }

    },

    "analyzer": {

    "my_analyzer": {

    "type": "custom",

    "char_filter": ["html_strip", "&_to_and"],

    "tokenizer": "standard",

    "filter": ["lowercase", "my_stopwords"]

    }

    }

    }

    }

    }

    测试手动创建的分词器

    GET /my_index/_analyze

    {

    "text": "tom&jerry are a friend in the house, <a>, HAHA!!",

    "analyzer": "my_analyzer"

    }

       

    PUT /my_index/_mapping/my_type

    {

    "properties": {

    "content": {

    "type": "text",

    "analyzer": "my_analyzer"

    }

    }

    }

  • 相关阅读:
    HDU 3374 String Problem(最小(大)表示 + KMP)
    HDU 1253 胜利大逃亡
    #include <cctype>
    HDU 4162 Shape Number(最小表示法)
    USACO section1.3 Mixing Milk 混合牛奶
    HDU 1572 下沙小面的(2)
    HDU 1969 Pie
    USACO section1.2 Milking Cows 挤牛奶(区间覆盖)
    HDU 2492 Ping pong (树状数组)
    筛选法打表:求某个数的素因子之和
  • 原文地址:https://www.cnblogs.com/liuqianli/p/8475474.html
Copyright © 2011-2022 走看看