zoukankan      html  css  js  c++  java
  • ES分词器详解

    一、分词器

    1、作用:①切词

          ②normalizaton(提升recall召回率:能搜索到的结果的比率)

    2、分析器

    ①character filter:分词之前预处理(过滤无用字符、标签等,转换一些&=>and 《Elasticsearch》=> Elasticsearch

      A、HTML Strip Character Filterhtml_strip

        escaped_tags   需要保留的html标签

    PUT my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "my_char_filter":{
              "type":"html_strip",
          "escaped_tags":["a"]
    } }, "analyzer": { "my_analyzer":{ "tokenizer":"keyword", "char_filter":"my_char_filter" } } } } }
    测试分词

      GET my_index/_analyze
      {
        "analyzer": "my_analyzer",
        "text": "liuyucheng <a><b>edu</b></a>"
      }

     

      B、Mapping Character Filtertype mapping

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "keyword",
              "char_filter": [
                "my_char_filter"
              ]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "٠ => 0",
                "١ => 1",
                "٢ => 2",
                "٣ => 3",
                "٤ => 4",
                "٥ => 5",
                "٦ => 6",
                "٧ => 7",
                "٨ => 8",
                "٩ => 9"
              ]
            }
          }
        }
      }
    }
    测试分词 POST my_index
    /_analyze { "analyzer": "my_analyzer", "text": "My license plate is ٢٥٠١٥" }

      C、Pattern Replace Character Filter:正则替换type pattern_replace

    PUT my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "char_filter": ["my_char_filter"]
            }
          },
          "char_filter": {
            "my_char_filter": {
              "type": "pattern_replace",
              "pattern": "(\d+)-(?=\d)",
              "replacement": "$1_"
            }
          }
        }
      }
    }
    测试分词 POST my_index
    /_analyze { "analyzer": "my_analyzer", "text": "My credit card is 123-456-789" }

    ②tokenizer:分词器

    ③token filter:时态转换、大小写转换、同义词转换、语气词处理等

            比如:has=>have  him=>he  apples=>apple  the/oh/a=>干掉

      A、大小写 lowercase token filter

    GET _analyze
    {
      "tokenizer" : "standard",
      "filter" : ["lowercase"],
      "text" : "THE Quick FoX JUMPs"
    }
    
    GET /_analyze
    {
      "tokenizer": "standard",
      "filter": [
        {
          "type": "condition",
          "filter": [ "lowercase" ],
          "script": {
            "source": "token.getTerm().length() < 5"
          }
        }
      ],
      "text": "THE QUICK BROWN FOX"
    }

      B、停用词 stopwords token filter

    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer":{
              "type":"standard",
              "stopwords":"_english_"
            }
          }
        }
      }
    }
    GET my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "Teacher Ma is in the restroom"
    }

      C、分词器  tokenizer :standard

    GET /my_index/_analyze
    {
      "text": "江山如此多娇,小姐姐哪里可以撩",
      "analyzer": "standard"
    }

      D、自定义 analysis,设置type为custom告诉Elasticsearch我们正在定义一个定制分析器。将此与配置内置分析器的方式进行比较: type将设置为内置分析器的名称,如 standard或simple

    PUT /test_analysis
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "test_char_filter": {
              "type": "mapping",
              "mappings": [
                "& => and",
                "| => or"
              ]
            }
          },
          "filter": {
            "test_stopwords": {
              "type": "stop",
              "stopwords": ["is","in","at","the","a","for"]
            }
          },
          "tokenizer": {
            "punctuation": { 
              "type": "pattern",
              "pattern": "[ .,!?]"
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": [
                "html_strip",
                "test_char_filter"
              ],
              "tokenizer": "standard",
              "filter": ["lowercase","test_stopwords"]
            }
          }
        }
      }
    }
    
    GET /test_analysis/_analyze
    {
      "text": "Teacher ma & zhang also thinks [mother's friends] is good | nice!!!",
      "analyzer": "my_analyzer"
    }

      E、创建mapping时候指定分词器

    PUT /test_analysis/_mapping/my_type
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "test_analysis"
        }
      }
    }

     二、中文分词器

    (1) 中文分词器:

      ① IK分词:ES的安装目录  不要有中文  空格

      1) 下载:https://github.com/medcl/elasticsearch-analysis-ik

      2) 创建插件文件夹 cd your-es-root/plugins/ && mkdir ik

      3) 将插件解压缩到文件夹 your-es-root/plugins/ik

      4) 重新启动es

      ② 两种analyzer

      1) ik_max_word细粒度

      2) ik_smart粗粒度

      ③ IK文件描述

      1) IKAnalyzer.cfg.xml:IK分词配置文件

      2) 主词库:main.dic

      3) 英文停用词stopword.dic不会建立在倒排索引中

      4) 特殊词库:

    1. quantifier.dic:特殊词库:计量单位等
    2. suffix.dic:特殊词库:后缀名
    3. surname.dic:特殊词库:百家姓
    4. preposition特殊词库:语气词

      5) 自定义词库:比如当下流行词:857、emmm...、渣女、舔屏、996

      6) 热更新:

    1. 修改ik分词器源码
    2. 基于ik分词器原生支持的热更新方案,部署一个web服务器,提供一个http接口,通过modifiedtag两个http响应头,来提供词语的热更新
  • 相关阅读:
    Oracle锁表与解锁 对象锁与解锁
    Unity3D开发之NGUI点击事件穿透响应处理
    Unity 3D 关于给APK包加广告的流程
    Unity 3D 粒子系统的一点经验
    Unity3D模型的细致纹理问题解决办法
    Unity 3D学习之 Prime31 Game Center插件用法
    Unity3D如何制作透贴和使用透贴模型
    NGUI的部分控件无法更改layer?
    关于Unity3D中Resources动态加载NGUI图片的方法
    关于NGUI的动态加载后的刷新显示问题,解决办法!!
  • 原文地址:https://www.cnblogs.com/lyc-code/p/13686642.html
Copyright © 2011-2022 走看看