ElasticSearch之分词器edge_ngram和ngram的区别

zoukankan html css js c++ java

ElasticSearch之分词器edge_ngram和ngram的区别
ElasticSearch一看就懂之分词器edge_ngram和ngram的区别
1 year ago
edge_ngram和ngram是ElasticSearch自带的两个分词器，一般设置索引映射的时候都会用到，设置完步长之后，就可以直接给解析器analyzer的tokenizer赋值使用。
这里，我们统一用字符串来做分词示例：
字符串
1. edge_ngram分词器，分词结果如下：
  {
  "tokens": [
  {
  "token": "字",
  "start_offset": 0,
  "end_offset": 1,
  "type": "word",
  "position": 0
  },
  {
  "token": "字符",
  "start_offset": 0,
  "end_offset": 2,
  "type": "word",
  "position": 1
  },
  {
  "token": "字符串",
  "start_offset": 0,
  "end_offset": 3,
  "type": "word",
  "position": 2
  }
  ]
  }
2. ngram分词器，分词结果如下：
  {
  "tokens": [
  {
  "token": "字",
  "start_offset": 0,
  "end_offset": 1,
  "type": "word",
  "position": 0
  },
  {
  "token": "字符",
  "start_offset": 0,
  "end_offset": 2,
  "type": "word",
  "position": 1
  },
  {
  "token": "字符串",
  "start_offset": 0,
  "end_offset": 3,
  "type": "word",
  "position": 2
  },
  {
  "token": "符",
  "start_offset": 1,
  "end_offset": 2,
  "type": "word",
  "position": 3
  },
  {
  "token": "符串",
  "start_offset": 1,
  "end_offset": 3,
  "type": "word",
  "position": 4
  },
  {
  "token": "串",
  "start_offset": 2,
  "end_offset": 3,
  "type": "word",
  "position": 5
  }
  ]
  }
  一目了然，看明白了吗？简单理解来说：edge_ngram的分词器，就是从首字开始，按步长，逐字符分词，直至最终结尾文字；ngram呢，就不仅是从首字开始，而是逐字开始按步长，逐字符分词。
  具体应用呢？如果必须首字匹配的情况，那么用edge_ngram自然是最佳选择，如果需要文中任意字符的匹配，ngram就更为合适了。
查看全文

相关阅读:
eclipse下jsp文件报错解决方法
 使用vscode搭建本地的websocket
tomcat的首次登录配置
 tomcat配置报错解决方法 The jre_home environment variable is not defined correctly
cento升级openssl依旧显示老版本
 Centos6安装mysql5.7最新版
 Neutron服务组件
 网络OSI 7层模型
 Kubernetes的核心技术概念和API对象
 Xen 虚拟化技术

原文地址：https://www.cnblogs.com/frankltf/p/13986940.html