Elasticsearch 深入3

zoukankan html css js c++ java

Elasticsearch 深入3

分词器的内部组成到底是什么，以及内置分词器的介绍

1、什么是分词器

切分词语，normalization（提升recall召回率）

给你一段句子，然后将这段句子拆分成一个一个的单个的单词，同时对每个单词进行normalization（时态转换，单复数转换），分词器
recall，召回率：搜索的时候，增加能够搜索到的结果的数量

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分词，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

一个分词器，很重要，将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

2、内置分词器的介绍

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）大小写转换括号去除等等
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

_query string的分词以及mapping引入案例遗留问题的大揭秘

1、query string分词

query string必须以和index建立时相同的analyzer进行分词
query string对exact value和full text的区别对待

date：exact value
_all：full text

比如我们有一个document，其中有一个field，包含的value是：hello you and me，建立倒排索引
我们要搜索这个document对应的index，搜索文本是hell me，这个搜索文本就是query string
query string，默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去进行分词，分词和normalization，只有这样，才能实现正确的搜索

我们建立倒排索引的时候，将dogs --> dog，结果你搜索的时候，还是一个dogs，那不就搜索不到了吗？所以搜索的时候，那个dogs也必须变成dog才行。才能搜索到。

知识点：不同类型的field，可能有的就是full text，有的就是exact value

post_date，date：exact value
_all：full text，分词，normalization

3、测试分词器

GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}

mapping的核心数据类型以及dynamic mapping

1、核心的数据类型

string
byte，short，integer，long
float，double
boolean
date

2、dynamic mapping

true or false --> boolean
123 --> long
123.45 --> double
2017-01-01 --> date
"hello world" --> string/text

3、查看mapping

GET /index/_mapping/type

手动建立和修改mapping以及定制string类型数据是否分词

1、如何建立索引

analyzed
not_analyzed
no

2、修改mapping

只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping

PUT /website
{
    "mappings":{
        "article":{
            "properties":{
                "author_id":{
                    "type":"long"
                },
                "title":{
                    "type":"text",
                    "analyzer":"english"
                },
                "content":{
                    "type":"text"
                },
                "post_date":{
                    "type":"date"
                },
                "publisher_id":{
                    "type":"text",
                    "index":"not_analyzed"
                }
            }
        }
    }
}

PUT /website
{
    "mappings":{
        "article":{
            "properties":{
                "author_id":{
                    "type":"text"
                }
            }
        }
    }
}

{
"error": {
"root_cause": [
{
"type": "index_already_exists_exception",
"reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
"index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
"index": "website"
}
],
"type": "index_already_exists_exception",
"reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
"index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
"index": "website"
},
"status": 400
}

PUT /website/_mapping/article
{
"properties" : {
"new_field" : {
"type" : "string",
"index": "not_analyzed"
}
}
}

3、测试mapping

GET /website/_analyze
{
"field": "content",
"text": "my-dogs"
}

GET website/_analyze
{
"field": "new_field",
"text": "my dogs"
}

{
"error": {
"root_cause": [
{
"type": "remote_transport_exception",
"reason": "[4onsTYV][127.0.0.1:9300][indices:admin/analyze[s]]"
}
],
"type": "illegal_argument_exception",
"reason": "Can't process field [new_field], Analysis requests are only supported on tokenized fields"
},
"status": 400
}

_filter与query深入对比解密：相关度，性能

1、filter与query对比大解密

filter，仅仅只是按照搜索条件过滤出需要的数据而已，不计算任何相关度分数，对相关度没有任何影响
query，会去计算每个document相对于搜索条件的相关度，并按照相关度进行排序

一般来说，如果你是在进行搜索，需要将最匹配搜索条件的数据先返回，那么用query；如果你只是要根据一些条件筛选出一部分数据，不关注其排序，那么用filter
除非是你的这些搜索条件，你希望越符合这些搜索条件的document越排在前面返回，那么这些搜索条件要放在query中；如果你不希望一些搜索条件来影响你的document排序，那么就放在filter中即可

2、filter与query性能

filter，不需要计算相关度分数，不需要按照相关度分数进行排序，同时还有内置的自动cache最常使用filter的数据
query，相反，要计算相关度分数，按照分数进行排序，而且无法cache结果

Text vs. keyword

ElasticSearch 5.0以后，string类型有重大变更，移除了string类型，string字段被拆分成两种新的数据类型: text用于全文搜索的,而keyword用于关键词搜索。

ElasticSearch对字符串拥有两种完全不同的搜索方式. 你可以按照整个文本进行匹配, 即关键词搜索(keyword search), 也可以按单个字符匹配, 即全文搜索(full-text search). 对ElasticSearch稍有了解的人都知道, 前者的字符串被称为not-analyzed字符, 而后者被称作analyzed字符串。

Text：会分词，然后进行索引

       支持模糊、精确查询

       不支持聚合

keyword：不进行分词，直接索引

       支持模糊、精确查询

       支持聚合

text用于全文搜索的, 而keyword用于关键词搜索.

如果想做类似于sql中的like查询，可定义为keyword并使用通配符wildcard方式查询。

查看全文

相关阅读:
斜率dp cdq 分治
 POJ2449 （k短路）
BZOJ1576 （最短路+并查集）
SWUST0249 （凸包面积）
道路修建（网络流）
HDU3930 （原根）
ZOJ2006 (后缀自动机）
Codechef2015 May
后缀自动机
 Digit (数位DP）

原文地址：https://www.cnblogs.com/jiahaoJAVA/p/11009392.html

Elasticsearch 深入3

Text vs. keyword