1. 分析数据
1.1 What's analysis?
- 倒排索引:索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为“倒排索引”。
- 分析:文档在建立倒排索引之前,Es 让每个被分析字段所做的一系列操作
- 字符过滤:使用字符过滤器转变字符,如大写转小写、& 变 and 等;
- 分词:将文本切分为单个或者多个词;
- 分词过滤:使用分词过滤器,转变每个分析;
- 分词索引:把这些分词和指向文档的关系放进索引;
举例:
1.2 Anatomyof an Analyzer
a. 3 parts
An analyzer — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters.
The built-in analyzers pre-package these building blocks into analyzers suitable for different languages and types of text. Elasticsearch also exposes the individual building blocks so that they can be combined to define new custom analyzers.
(1)Character Filters
A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠١٢٣٤٥٦٧٨٩
) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b>
from the stream.
An analyzer may have zero or more character filters, which are applied in order.
(2)Tokenizer
A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!"
into the terms [Quick, brown, fox!]
.
The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.
An analyzer must have exactly one tokenizer.
(3)Token Filters
A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.
Token filters are not allowed to change the position or character offsets of each token.
An analyzer may have zero or more token filters, which are applied in order.
b. Built-in analyzers
Elasticsearch ships with a wide range of built-in analyzers, which can be used in any index without further configuration:
(1)Standard Analyzer
The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation(删除大多数标点符号), lowercases terms(字母大写转小写), and supports removing stop words(删除停用词).
(2)Simple Analyzer
The simple analyzer divides text into terms whenever it encounters a character which is not a letter(遇到非字母就分词). It lowercases all terms(所有字母转小写).
(3)Whitespace Analyzer
The whitespace analyzer divides text into terms whenever it encounters any whitespace character(按空格分词). It does not lowercase terms(不转字母大小写).
(4)Stop Analyzer
The stop analyzer is like the simple analyzer(和 simple 类似), but also supports removal of stop words(过滤掉停用词).
(5)Keyword Analyzer
The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term(把字段当作关键词;最好别用,直接用 keyword 类型存储就不分析字段了).
(6)Pattern Analyzer
The pattern analyzer uses a regular expression to split the text into terms(可以匹配正则表达式作为分词条件). It supports lower-casing and stop words(支持转小写和删除停用词).
(7)Language Analyzers
Elasticsearch provides many language-specific analyzers like english or french(多语言分词器,用于特定语言的字符串;34 种,但不包括中文).
(8)Fingerprint Analyzer
The fingerprint analyzer is a specialist analyzer which creates a fingerprint(生成指纹) which can be used for duplicate detection(重复检测,查重).
If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.
1.3 Char Filter
Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.
(1)HTML Strip Character Filter
The html_strip character filter strips out HTML elements like <b>
(剔除 HTML 标签) and decodes HTML entities like &
(并对 HTML 编码的符号解码).
(2)Mapping Character Filter
The mapping character filter replaces any occurrences of the specified strings(指定字符串的出现) with the specified replacements(指定的替换).
(3)Pattern Replace Character Filter
The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement(正则替换).
1.4 Tokenizer
The tokenizer is also responsible for recording the following:
- Order or position of each term (used for phrase and word proximity queries)
- Start and end character offsets of the original word which the term represents (used for highlighting search snippets).
- Token type, a classification of each term produced, such as
<ALPHANUM>
,<HANGUL>
, or<NUM>
. Simpler analyzers only produce the word token type.
Elasticsearch has a number of built in tokenizers which can be used to build custom analyzers.
a. Word Oriented Tokenizers
The following tokenizers are usually used for tokenizing full text into individual words:
(1)Standard Tokenizer
The standard tokenizer divides text into terms on word boundaries(单词边界), as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols(标点符号). It is the best choice for most languages.
(2)Letter Tokenizer
The letter tokenizer divides text into terms whenever it encounters a character which is not a letter(只要遇到不是字母就分词).
(3)Lowercase Tokenizer
The lowercase tokenizer, like the letter tokenizer(相当于字母分词器), divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms(但是还会把词转小写).
(4)Whitespace Tokenizer
The whitespace tokenizer divides text into terms whenever it encounters any whitespace character(遇到空格、制表符、换行等空白符合就分词,注意不会去掉标点).
(5)UAX URL Email Tokenizer
The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens(标准分词器的基础上,会把 URL 和邮箱地址识别成一个词).
(6)Classic Tokenizer
The classic tokenizer is a grammar based tokenizer for the English Language.
(7)Thai Tokenizer
The thai tokenizer segments Thai text into words.
b. Partial Word Tokenizers
These tokenizers break up text or words into small fragments, for partial word matching:
(1)N-Gram Tokenizer(N 元语法)
The Ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters(连续字母的滑动窗口), e.g. quick → [qu, ui, ic, ck].
先按空格或标点切割成单词,再把单词切成 N 个字符的片段。
(2)Edge N-Gram Tokenizer(侧边 N 元语法)
The edge_ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word(锚定在单词的开头), e.g. quick → [q, qu, qui, quic, quick].
从一侧开始切词,做类似前缀匹配的搜索。
c. Structured Text Tokenizers
The following tokenizers are usually used with structured text(结构化文本) like identifiers, email addresses, zip codes, and paths, rather than with full text:
(1)Keyword Tokenizer
The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term(啥都没干). It can be combined with token filters like lowercase to normalise the analysed terms.
(2)Pattern Tokenizer
The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.
The default pattern is \W+
, which splits text whenever it encounters non-word characters.
POST _analyze
{
"tokenizer": "pattern",
"text": "The foo_bar_size's default is 5."
}
-----------------------------------------------
[ The, foo_bar_size, s, default, is, 5 ]
(3)Simple Pattern Tokenizer
The simple_pattern tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the pattern tokenizer(简化的模式分析器,速度稍快).
This tokenizer does not support splitting the input on a pattern match, unlike the pattern tokenizer. To split on pattern matches using the same restricted regular expression subset, see the simple_pattern_split tokenizer.
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "an_underscored_phrase"
}
-----------------------------------------------
[ an, underscored, phrase ]
(4)Char Group Tokenizer
The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set(遇到定义集中的字符时将文本分解为词). It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable(使用模式标记器的开销不可接受的情况).
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n"
]
},
"text": "The QUICK brown-fox"
}
-----------------------------------------------
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "QUICK",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "brown",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "fox",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 3
}
]
}
(5)Simple Pattern Split Tokenizer
The simple_pattern_split tokenizer uses a regular expression to split the input into terms at pattern matches. The set of regular expression features it supports is more limited than the pattern tokenizer, but the tokenization is generally faster.
This tokenizer does not produce terms from the matches themselves. To produce terms from matches using patterns in the same restricted regular expression subset, see the simple_pattern tokenizer(类似简化模式,但不同之处在于匹配中的短语是作为分隔符,而不是分词).
This tokenizer uses Lucene regular expressions. For an explanation of the supported features and syntax, see Regular Expression Syntax.
The default pattern is the empty string, which produces one term containing the full input. This tokenizer should always be configured with a non-default pattern.
(6)Path Tokenizer
The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. /foo/bar/baz → [/foo, /foo/bar, /foo/bar/baz ].
1.5 Token Filter
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
Apostroph( [əˈpɒstrəfi]
撇号)
ASCII folding
CJK bigram
CJK width
Classic
Common grams
Conditional
Decimal digit
Delimited payload
Dictionary decompounder
Edge n-gram
Elision
Fingerprint
Flatten graph
Hunspell
Hyphenation decompounder
Keep types
Keep words
Keyword marker
Keyword repeat
KStem
Length
Limit token count
Lowercase
MinHash
Multiplexer
N-gram
Normalization
Pattern capture
Pattern replace
Phonetic
Porter stem
Predicate script
Remove duplicates
Reverse
Shingle
Snowball
Stemmer
Stemmer override
Stop
Synonym
Synonym graph
Trim
Truncate
Unique
Uppercase
Word delimiter
Word delimiter graph
1.6 Others
a. 配置分析器
- 创建索引的时候配置分析器;
- 使用 template 配置分析器;
- 在 Es 的配置里设置全局默认分析器;
- 全文检索类的搜索语句可以指定分析器,优先级如下:
- query 参数里指定的;
- 被搜字段的 search_analyzer 指定的;
- 被搜字段的 analyzer 指定的;
- index 配置里 default_search 指定的;
- index 配置里 default 指定的;
- Standard Analyzer
b. 使用分析 API
_analyze
- 对指定字符串使用指定分析器进行分析,直接展示分析结果;
- 可以指定各种预定义分析器、自定义分析器;
- 甚至可以分别指定字符过滤器、分词器、分词过滤器;
_termvectors
- 查看某个具体的文档的具体索引信息;
- 这个文档有哪些分析,以及每个分词的词频、位置、开始和结束位置等;
2. 相关性计算/评分机制
2.1 TF/IDF
Relevance Score 算法,简单来说就是计算出一个索引中的文本与搜索文本,他们之间的关联匹配程度。
Es 使用的是 Term Frequency(词频)/Inverse Document Frequency(逆向文件频率) 算法,简称为 TF/IDF 算法。
- Term Frequency:搜索文本中的各个词条在 field 文本中出现了多少次,出现次数越多,就越相关;
- Inverse Document Frequency:搜索文本中的各个词条在整个索引的所有文档中出现了多少次,出现的次数越多就越不相关;
- Field-length Norm:包含搜索内容分词的 field 越长,相关度越弱(你在 title 中命中和在 body 中命中)。
2.2 分析相关 API
- _score 是如何被计算出来的
GET /book/_search?explain=true { "query": { "match": { "description": "Java程序员" } } }
- 分析一个 document 是如何被匹配上的
GET /book/_explain/1101 { "query": { "match": { "description": "Java程序员" } } }
3. 聚集
3.1 什么是聚集?
大概可以理解为分类统计,比如对一组数据的某个词条进行计数、或者计算某个数值型字段的平均值。
在 Kibana 上随处可见,各种 visualize 都是基于此。
分为“度量聚集”和“桶聚集”。
对比搜索最大的不同:(1)不能使用倒排索引,需要用到字段数据(未被分析的字段的数据);(2)聚集时会将倒排索引反转回字段数据,塞进内存,因此如果聚集操作频繁,就需要大量内存。
后过滤器:
- 正常情况下过滤查询是先执行的,聚集在此基础上运行;
- 有时候需要先对所有数据进行聚集,再过滤查询出一些数据展示;
- 后过滤器是在聚集之后运行,和聚集操作相对独立,需要注意性能。
3.2 度量聚集
// TODO
3.3 桶聚集
// TODO
4. 提升性能
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/tune-for-search-speed.html
a. 提升写入性能
(1)用 bulk 接口批量写入
- 可以节省重复创建连接的网络开销;
- 要通过测试才能知道最佳的一次批处理量,并不是越大越好,太大了会占用内存;
- bulk 有个处理队列,过慢的 index 会导致队列满而丢弃后面的请求;
(2)配置慢一点的刷新频率
- Es 是准实时系统,新写入的分段需要被刷新才被完全创建,才可用于查询;
- 慢的刷新频率可以降低分段合并频率,分段合并十分耗资源;
- 默认刷新频率是 1s,对 index 修改
index.refresh.inverval
即可立即生效;
(3)初始化性值的大量写入
- 比如 reindex 或者是导入基础数据这种一次性批量索引操作;
- 可以配置成不刷新,并且把副本数也配置成 0,完了之后再设置成正常值;
- 每一次写入都要等所有副本都报告写入完成才算完,副本数量越多写入越慢;
(4)关闭 OS 的 swapping
- OS 会自动把不常用的内存交换到磁盘(虚拟内存);
- Es 是运行于 JVM 的,这个操作可能会导致 GC;
(5)使用内部 id
- 默认是指明文档 id 的,但这样的话 Es 需要先判断一下这个 id 的文档是否已经存在以做一些合并或者更新操作;
- 如果用自生成的 id,则可以跳过这个步骤节省开支;
(6)合理设置分片和副本数量
- 分片数量影响到分段数量,分片少的话允许的分段量也会少(小分片会导致小分段),从而会增加分段合并的频率,消耗性能;
- 如果写入规模巨大,要控制 index 的规模(按月、按周、按天适当分,或自动滚动),同时根据集群节点数量设置合适的分片数,使得每个分片的数据量有限;
- 副本数量越多,写入越慢;
(7)合理设置字段 mapping
- 不需要分析的字段就不要分析
b. 提升查询性能
(1)使用过滤上下文
- 不计算得分可以减少资源消耗
- 过滤器还可以缓存
(2)避免脚本
- 脚本非常好性能,因为每次计算且无法缓存;
- 如果非用不可,用 painless 或 expression;
(3)提前索引字段
// TODO
c. 节省磁盘空间
// TODO