zoukankan      html  css  js  c++  java
  • 多字段特性及Mapping中配置自定义Analyzer

    报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried

    #原因:线程占用
    #杀死进程,启动进程
    kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk '{print $2}'`
    elasticsearch
    

    多字段特性

    厂商名字实现精确匹配
    增加一个keyword字段
    使用不同的analyzer
    不同语言
    pinyin字段的搜索
    还支持为搜索和索引指定不通的analyzer

    Exact Values v.s Full Text

    Exact Values:包括数字、日期、具体一个字符串(例如"Apple Store")
    Elasticsearch中的keyword
    全文本,非结构化的文本数据
    Elasticsearch中的text

    Exact Values不需要分词

    Elasticsearch为每一个字段创建一个倒排索引
    Exact Value在索引时,不需要做特殊的分词处理

    自定义分词

    当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
    Character Filter
    Tokenizer
    Token Filter

    Character Filter

    在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
    一些自带的Character Filters
    HTML strip - 去除 html 标签
    Mapping - 字符串替换
    Pattern replace - 正则匹配替换

    Tokenizer

    将原始的文本按照一定的规则,切分为词(term or token)
    Elasticsearch内置的Tokenizers
    whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
    可以用java开发插件,实现自己的Tokenizer

    Token Filters

    将Tokenizer输出的单词(term),进行增加,修改,删除
    自带的Token Filters
    Lowercase / stop / synonym(添加近义词)

    设置一个Custom Analyzer

    提交请求,清除html标签

    POST _analyze
    {
      "tokenizer": "keyword",
      "char_filter": ["html_strip"],
      "text":"<b>hello world</b>"
    }
    

    返回响应

    {
      "tokens" : [
        {
          "token" : "hello world",
          "start_offset" : 3,
          "end_offset" : 18,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    使用char filter进行替换减号

    POST _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
        {
          "type":"mapping",
          "mappings":["- => _"]
        }
        ],
        "text": "123-456, I-test! test-990 650-555-1234"
    }
    

    返回结果

    {
      "tokens" : [
        {
          "token" : "123_456",
          "start_offset" : 0,
          "end_offset" : 7,
          "type" : "<NUM>",
          "position" : 0
        },
        {
          "token" : "I_test",
          "start_offset" : 9,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "test_990",
          "start_offset" : 17,
          "end_offset" : 25,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "650_555_1234",
          "start_offset" : 26,
          "end_offset" : 38,
          "type" : "<NUM>",
          "position" : 3
        }
      ]
    }
    

    char filter 替换表情符号

    POST _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
        {
          "type":"mapping",
          "mappings":[":) => happy",":( => sad"]
        }
        ],
        "text": ["I am felling :)","Feeling :( today"]
    }
    

    返回响应

    {
      "tokens" : [
        {
          "token" : "I",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "am",
          "start_offset" : 2,
          "end_offset" : 4,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "felling",
          "start_offset" : 5,
          "end_offset" : 12,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "happy",
          "start_offset" : 13,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "Feeling",
          "start_offset" : 16,
          "end_offset" : 23,
          "type" : "<ALPHANUM>",
          "position" : 104
        },
        {
          "token" : "sad",
          "start_offset" : 24,
          "end_offset" : 26,
          "type" : "<ALPHANUM>",
          "position" : 105
        },
        {
          "token" : "today",
          "start_offset" : 27,
          "end_offset" : 32,
          "type" : "<ALPHANUM>",
          "position" : 106
        }
      ]
    }
    

    正则表达式

    GET _analyze
    {
      "tokenizer": "standard",
      "char_filter": [
        {
          "type":"pattern_replace",
          "pattern":"http://(.*)",
          "replacement":"$1"
        }
        ],
        "text": "http://www.elastic.co"
    }
    

    返回值

      "tokens" : [
        {
          "token" : "www.elastic.co",
          "start_offset" : 0,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 0
        }
      ]
    }
    

    按目录切分

    POST _analyze
    {
      "tokenizer": "path_hierarchy",
      "text": "/usr/ymruan/a/b"
    }
    

    返回结果

    {
      "tokens" : [
        {
          "token" : "/usr",
          "start_offset" : 0,
          "end_offset" : 4,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/ymruan",
          "start_offset" : 0,
          "end_offset" : 11,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/ymruan/a",
          "start_offset" : 0,
          "end_offset" : 13,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "/usr/ymruan/a/b",
          "start_offset" : 0,
          "end_offset" : 15,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    

    whitespace与stop

    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["stop"],
      "text": ["The rain in Spain falls mainly on the plain."]
    }
    

    返回结果

    {
      "tokens" : [
        {
          "token" : "The",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "rain",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "Spain",
          "start_offset" : 12,
          "end_offset" : 17,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "falls",
          "start_offset" : 18,
          "end_offset" : 23,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "mainly",
          "start_offset" : 24,
          "end_offset" : 30,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "plain.",
          "start_offset" : 38,
          "end_offset" : 44,
          "type" : "word",
          "position" : 8
        }
      ]
    }
    

    remove 加入lowercase后,The被当成stopword删除

    GET _analyze
    {
      "tokenizer": "whitespace",
      "filter": ["lowercase","stop"],
      "text": ["The rain in Spain falls mainly on the plain."]
    }
    

    返回结果

    {
      "tokens" : [
        {
          "token" : "rain",
          "start_offset" : 4,
          "end_offset" : 8,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "spain",
          "start_offset" : 12,
          "end_offset" : 17,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "falls",
          "start_offset" : 18,
          "end_offset" : 23,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "mainly",
          "start_offset" : 24,
          "end_offset" : 30,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "plain.",
          "start_offset" : 38,
          "end_offset" : 44,
          "type" : "word",
          "position" : 8
        }
      ]
    }
    

    听20章10分钟视频再记录

  • 相关阅读:
    react的路由权限控制
    react的路由中的switch和exact的使用
    react中antd的表格自定义展开
    webstorm的git操作使用
    ES6的相关语法
    vue导出文件下载
    vue如何解析xml文件 x2js
    ES6模板字符串
    彻底卸载微软拼音输入法
    systemverilog新增的always_comb,always_ff,和always_latch语句
  • 原文地址:https://www.cnblogs.com/anyux/p/11939205.html
Copyright © 2011-2022 走看看