zoukankan      html  css  js  c++  java
  • ElasticSearch之分词器edge_ngram和ngram的区别

    ElasticSearch一看就懂之分词器edge_ngram和ngram的区别
    1 year ago
    edge_ngram和ngram是ElasticSearch自带的两个分词器,一般设置索引映射的时候都会用到,设置完步长之后,就可以直接给解析器analyzer的tokenizer赋值使用。
    这里,我们统一用字符串来做分词示例:
    字符串

    edge_ngram分词器,分词结果如下:
    {
    "tokens": [
    {
    "token": "字",
    "start_offset": 0,
    "end_offset": 1,
    "type": "word",
    "position": 0
    },
    {
    "token": "字符",
    "start_offset": 0,
    "end_offset": 2,
    "type": "word",
    "position": 1
    },
    {
    "token": "字符串",
    "start_offset": 0,
    "end_offset": 3,
    "type": "word",
    "position": 2
    }
    ]
    }
    ngram分词器,分词结果如下:
    {
    "tokens": [
    {
    "token": "字",
    "start_offset": 0,
    "end_offset": 1,
    "type": "word",
    "position": 0
    },
    {
    "token": "字符",
    "start_offset": 0,
    "end_offset": 2,
    "type": "word",
    "position": 1
    },
    {
    "token": "字符串",
    "start_offset": 0,
    "end_offset": 3,
    "type": "word",
    "position": 2
    },
    {
    "token": "符",
    "start_offset": 1,
    "end_offset": 2,
    "type": "word",
    "position": 3
    },
    {
    "token": "符串",
    "start_offset": 1,
    "end_offset": 3,
    "type": "word",
    "position": 4
    },
    {
    "token": "串",
    "start_offset": 2,
    "end_offset": 3,
    "type": "word",
    "position": 5
    }
    ]
    }
    一目了然,看明白了吗?简单理解来说:edge_ngram的分词器,就是从首字开始,按步长,逐字符分词,直至最终结尾文字;ngram呢,就不仅是从首字开始,而是逐字开始按步长,逐字符分词。
    具体应用呢?如果必须首字匹配的情况,那么用edge_ngram自然是最佳选择,如果需要文中任意字符的匹配,ngram就更为合适了。
    原文链接:https://blog.csdn.net/Frankltf/article/details/109734447

  • 相关阅读:
    自我介绍
    工作流
    spring框架
    关于建立内部会议讨论规范的想法
    论文第3章:移动绘图平台的架构设计
    迭代器、推导式、函数式编程学习笔记
    Python装饰器学习(九步入门)
    Inkpad绘图原理浅析
    从零开始创建一个插件
    Entity Framework的启动速度优化
  • 原文地址:https://www.cnblogs.com/nizuimeiabc1/p/14749025.html
Copyright © 2011-2022 走看看