24.通过ngram分词机制实现index-time搜索推荐

zoukankan html css js c++ java

24.通过ngram分词机制实现index-time搜索推荐
一、ngram和index-time搜索推荐原理

1、什么是ngram

假设有一个单词：quick，在5种长度下的ngram情况如下：
ngram length=1，q u i c k
ngram length=2，qu ui ic ck
ngram length=3，qui uic ick
ngram length=4，quic uick
ngram length=5，quick

什么是edge ngram，就是首字母后进行ngram。比如quick这个单词，拆分如下：
- q
- qu
- qui
- quic
- quick
使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能，搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了就不再进行其他扫描。这就类似match的全文检索。

2、什么是index-time
index-time搜索推荐是指在建立索引时就把搜索推荐的倒排索引建立好，在搜索时就不用再根据前缀去建立。

min ngram = 1，是指推荐的分词最小的个字母个数，如hello 分词为h
max ngram = 3，是指推荐的分词最大的个字母个灵敏，如hello 分词为hel之后就不再进行分词，也就是说不再分词为hell。

二、实验
1、建立索引

PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
2、查看分词情况

GET /my_index/_analyze
{
"analyzer": "autocomplete",
"text": "quick brown"
}

3、加入搜索数据的mapping

PUT /my_index/_mapping/my_type
{
"properties": {
"title": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}

4、进行推荐搜索

GET /my_index/my_type/_search
{
"query": {
"match_phrase": {
"title": "hello w"
}
}
}
GET /my_index/my_type/_search
{
"query": {
"match": {
"title": "hello w"
}
}
}

如果用match，只有hello的也会出来，全文检索，只是分数比较低
推荐使用match_phrase，要求每个term都有，而且position刚好靠着1位，符合我们的期望的
查看全文

相关阅读:
从程序员到项目经理（16）：原来一切问题都是可以解决的【转载】
从程序员到项目经理（15）：项目管理三大目标【转载】
从程序员到项目经理（14）：项目经理必须懂一点“章法”【转载】
从程序员到项目经理（13）：如何管理自己的时间（下）【转载】
从程序员到项目经理（12）：如何管理自己的时间（上）【转载】
Linux的五个查找命令
 AWK 简明教程
 libstdc++.so.5: cannot open shared object file: No such file or directory
中文分词器ICTCLAS使用方法（Java）
Jetty实战之安装运行部署

原文地址：https://www.cnblogs.com/liuqianli/p/8527557.html