Elasticsearch 入门
简介
全文搜索属于最常见的需求,开源的 Elasticsearch 是目前全文搜索引擎的首选。它可以快速地存储、搜索和分析海量数据。它的底层是开源库 Lucene,但是 Lucene 不能直接使用,必须自己写代码去调用它的接口。而 Elastic 是 Lucene 的封装,提供了 Rest API,可以开箱即用。
基本概念
Index(索引)
动词意思,添加数据,相当于 MySQL 中的 insert;
名词意思,保存数据的地方,相当于 MySQL 中的 Database。
Type(类型)
在 Index(索引)中,可以定义一个或多个 Type(类型)。相当于 MySQL 中的 Table,它将每一种类型的数据放在一起。
Document(文档)
保存在某个 Index(索引)下,某种 Type(类型)的一个数据,就叫做 Document(文档),文档是 JSON 格式的,它相当于 MySQL 中的摸一个 Table 里面的内容。
倒排索引
为什么 ES 搜索快?这是因为使用了倒排索引。
通过分词,将整句拆分为单词。
假设保存的记录为:
- 红海行动
- 探索红海行动
- 红海特别行动
- 红海记录片
- 特工红海特别探索
那么会得到倒排索引表为:
词 | 记录 |
---|---|
红海 | 1,2,3,4,5 |
行动 | 1,2,3 |
探索 | 2,5 |
特别 | 3,5 |
纪录片 | 4, |
特工 | 5 |
例如检索:红海特工行动,查出后计算相关性得分,3 号记录命中了 2 次,且 3 号本身才有 3 个单词,2/3,所以 3 号最匹配。
例如检索:红海行动,1 号最匹配。
去掉 Type 概念
关系型数据库中两个数据表示是独立的,即使它们里面有相同名称的列也不影响使用,但 ES 中不是这样的。ES 是基于Lucene 开发的搜索引擎, ES 中不同 type 下名称相同的 filed 最终在 Lucene 中的处理方式是一样的。
- 两个不同 type 下的两个 user_name,在 ES 同一个索引下其实被认为是同一个 filed,必须在两个不同的 type 中定义相同的 filed 映射。否则,不同 type 中的相同字段名称就会在处理中出现冲突的情况,导致 Lucene 处理效率下降。
- 去掉 type 就是为了提高 ES 处理数据的效率。
- Elasticsearch 7.x 中,URL 中的 type 参数为可选。比如,索引一个文档不再要求提供文档类型。
- Elasticsearch 8.x 中,不再支持 URL 中的 type 参数。
- 解决方法:将索引从多类型迁移到单类型,每种类型文档一个独立索引。
Docker 安装 ES
-
下载安装 elasticsearch(存储和检索)和 kibana(可视化检索)
docker pull elasticsearch:7.8.0 docker pull kibana:7.8.0
-
配置
# 将 docker 里的目录挂载到 linux 的 /docker 目录中 # 修改 /docker 就可以改掉 docker 里的 mkdir -p /docker/elasticsearch7.8.0/config mkdir -p /docker/elasticsearch7.8.0/data mkdir -p /docker/elasticsearch7.8.0/plugins # 让 es 可以被远程任何机器访问 echo "http.host: 0.0.0.0" >> /docker/elasticsearch7.8.0/config/elasticsearch.yml # 修改文件权限 chmod -R 777 /docker/elasticsearch7.8.0/
-
启动 elasticsearch
# 查看可用内存 [root@10 /]# free -m total used free shared buff/cache available Mem: 990 616 72 1 302 232 Swap: 2047 393 1654 # 9200 是用户交互端口,9300 是集群心跳端口 # 第一个 -e,指定是单阶段运行 # 第二个 -e,指定占用的内存大小,生产时可以设置 32G # 考虑到虚拟机情况,设置内存不超过 512m docker run --name elasticsearch7.8.0 -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e ES_JAVA_OPTS="-Xms64m -Xmx512m" -v /docker/elasticsearch7.8.0/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /docker/elasticsearch7.8.0/data:/usr/share/elasticsearch/data -v /docker/elasticsearch7.8.0/plugins:/usr/share/elasticsearch/plugins -d elasticsearch:7.8.0 # 设置开机启动 docker update elasticsearch7.8.0 --restart=always
-
测试 elasticsearch
访问 http://192.168.56.56:9200/ 返回 elasticsearch 版本信息 { "name": "0f6d6c60bc96", "cluster_name": "elasticsearch", "cluster_uuid": "sDTdW7KnQayVrFC5ioijiQ", "version": { "number": "7.8.0", "build_flavor": "default", "build_type": "docker", "build_hash": "757314695644ea9a1dc2fecd26d1a43856725e65", "build_date": "2020-06-14T19:35:50.234439Z", "build_snapshot": false, "lucene_version": "8.5.1", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1" }, "tagline": "You Know, for Search" } 访问 http://192.168.56.56:9200/_cat/nodes 返回 elasticsearch 节点信息 127.0.0.1 60 93 6 0.04 0.19 0.18 dilmrt * 0f6d6c60bc96
-
启动 kibana
# kibana 指定了 ES 交互端口 9200 # 5601 为 kibana 主页端口 docker run --name kibana7.8.0 -e ELASTICSEARCH_HOSTS=http://192.168.56.56:9200 -p 5601:5601 -d kibana:7.8.0 # 设置开机启动 docker update kibana7.8.0 --restart=always
-
测试 kibana
访问 http://192.168.56.56:5601 返回可视化界面
Docker 安装 Nginx
-
启动一个 Nginx 实例,复制出配置文件
# 不存在时会自动下载 docker run -p 80:80 --name nginx1.10 -d nginx:1.10 # 创建存放 nginx 的文件夹 mkdir docker/nginx1.10 # 把容器内的配置文件拷贝到当前目录 cd docker/ docker container cp nginx1.10:/etc/nginx . # 暂停删除容器,修改文件名称为 conf,并移动到 nginx1.10 文件夹 docker stop nginx1.10 docker rm nginx1.10 mv nginx conf mv conf nginx1.10/
-
启动 Nginx
docker run -p 80:80 --name nginx1.10 -v /docker/nginx1.10/html:/usr/share/nginx/html -v /docker/nginx1.10/logs:/var/log/nginx -v /docker/nginx1.10/conf:/etc/nginx -d nginx:1.10 # 设置开机启动 docker update nginx1.10 --restart=always
-
测试 Nginx
访问 http://192.168.56.56 返回界面
初步检索
检索信息
-
GET /_cat/nodes
查看所有节点# http://192.168.56.56:9200/_cat/nodes 127.0.0.1 64 93 2 0.00 0.03 0.10 dilmrt * 0f6d6c60bc96 # 0f6d6c60bc96 代表节点,* 代表主节点
-
GET /_cat/health
查看 es 健康状况# http://192.168.56.56:9200/_cat/health 1617779285 07:08:05 elasticsearch green 1 1 6 6 0 0 0 0 - 100.0% # green 表示健康值正常
-
GET/_cat/master
查看主节点# http://192.168.56.56:9200/_cat/master -fBJbk3HQxq4oxHVP5o8XQ 127.0.0.1 127.0.0.1 0f6d6c60bc96 # -fBJbk3HQxq4oxHVP5o8XQ 代表主节点唯一编号 # 127.0.0.1 代表虚拟机地址
-
GET/_cat/indices
查看所有索引,相当于 MySQL 中的show databases;
# http://192.168.56.56:9200/_cat/indices green open .kibana-event-log-7.8.0-000001 NSvWWbd7SaqNmoJ6QmjIRg 1 0 1 0 5.3kb 5.3kb green open .apm-custom-link mn9tqI-0QnOkI5JAp1rCHw 1 0 0 0 208b 208b green open .kibana_task_manager_1 k5bSwn03TA-Hpisuzf677A 1 0 5 2 74.2kb 74.2kb green open .apm-agent-configuration ZXRvqEdDSL2555OE8MyNSA 1 0 0 0 208b 208b green open .kibana_1 _yCppL1mQ1a0-v88yOXNTQ 1 0 13 1 72.4kb 72.4kb
新增文档
保存一个数据,保存在哪个索引的哪个类型下,相当于 MySQL 中的哪个数据库的哪张表下。指定用哪一个唯一标识。
PUT customer/external/1
在 customer 索引下的 external 类型下保存 1 号数据:
# postman 新增文档-PUT
# PUT http://192.168.56.56:9200/customer/external/1
# 创建数据成功后,显示 201 created 表示插入记录成功。
# 发送多次是更新操作
{
"_index": "customer", # 表明该数据在哪个数据库下
"_type": "external", # 表明该数据在哪个类型下
"_id": "1", # 表明被保存数据的 id
"_version": 1, # 被保存数据的版本
"result": "created", # 创建了一条数据,如果重新 put 一条数据,则该状态会变为 updated,并且版本号也会发生变化
"_shards": { # 分片
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0, # 序列号
"_primary_term": 1
}
POST customer/external
# postman 新增文档-POST
# POST http://192.168.56.56:9200/customer/external
# 发送多次是更新操作
{
"_index": "customer",
"_type": "external",
"_id": "dBNCq3gBsa8QUaibccNi", # 不指定 ID,会自动的生成 id,并且类型是新增的
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1
}
POST customer/external/3
# postman 新增文档-POST
# POST http://192.168.56.56:9200/customer/external/3
# 发送多次是更新操作
{
"_index": "customer",
"_type": "external",
"_id": "3", # 指定 ID,会使用该 id,并且类型是新增的
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 5,
"_primary_term": 1
}
总结:
- POST 新增。如果不指定 id,会自动生成 id。
- 可以不指定 id,不指定 id 时永远为创建
- 指定不存在的 id 时也为创建
- 指定存在的 id 时为更新,并且 version 会根据内容变没变而指定版本号是否递增
- PUT 新增或修改。PUT 必须指定 id。
- 一般用来做修改操作,不指定 id 会报错
- version 总是递增
_version
指版本号,起始值都为 1,每次对当前文档成功操作后都加 1_seq_no
指序列号,在第一次为索引插入数据时为 0,每对索引内数据操作成功一次加 1, 并且文档会记录是第几次操作使它成为现在的情况的
查询文档
GET /customer/external/1
# GET http://192.168.56.56:9200/customer/external/1
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 2,
"_seq_no": 1, # 并发控制字段,每次更新都会 +1,用来做乐观锁
"_primary_term": 1, # 同上,主分片重新分配,如重启,就会变化
"found": true,
"_source": {
"name": "parzulpan"
}
}
乐观锁用法:通过 if_seq_no=1&if_primary_term=1
参数,当序列号匹配的时候,才进行修改,否则不修改。
-
将 name 更新为 parzulpan1
# PUT http://192.168.56.56:9200/customer/external/1?if_seq_no=1&if_primary_term=1 { "_index": "customer", "_type": "external", "_id": "1", "_version": 3, "result": "updated", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "_seq_no": 8, "_primary_term": 1 } # 再次查询 # GET http://192.168.56.56:9200/customer/external/1 { "_index": "customer", "_type": "external", "_id": "1", "_version": 4, "_seq_no": 9, "_primary_term": 1, "found": true, "_source": { "name": "parzulpan1" } }
更新文档
方式一:POST customer/external/1/_update
# POST http://192.168.56.56:9200/customer/external/1/_update
{
"doc": { # 注意要带上 doc
"name": "parzulpanUpdate"
}
}
# 返回
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 5,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 10,
"_primary_term": 1
}
# 如果再次执行更新,则不执行任何操作,版本号和序列号也不发生变化
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 5,
"result": "noop", # 无操作
"_shards": {
"total": 0,
"successful": 0,
"failed": 0
},
"_seq_no": 10,
"_primary_term": 1
}
方式二:POST customer/external/1
# POST http://192.168.56.56:9200/customer/external/1
{
"name": "parzulpanUpdate"
}
# 返回
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 6,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 11,
"_primary_term": 1
}
# 如果再次执行更新,数据会更新成功,并且版本号和序列号会发生变化
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 7,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 12,
"_primary_term": 1
}
方式三:PUT customer/external/1
# PUT http://192.168.56.56:9200/customer/external/1/
{
"name": "parzulpanUpdate"
}
# 返回
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 8,
"result": "updated",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 13,
"_primary_term": 1
}
# 如果再次执行更新,数据会更新成功,并且版本号和序列号会发生变化
总结:
- POST ,带 _update 时,如果数据相同,不会重新保存并且版本号和序列号不会发生变化
- POST ,不带 _update 时,总是会重新保存并且版本号和序列号会发生变化
- PUT,总是会重新保存并且版本号和序列号会发生变化
- 使用场景:对于大并发更新,推荐不带 _update,而对于大并发查询且偶尔更新,推荐带 _update
删除文档或索引
注意,ES 并没有提供删除类型的操作,只提供了删除文档或者索引的操作。
删除文档
# 删除 id=1 的数据,删除后继续查询
# DELETE http://192.168.56.56:9200/customer/external/1
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 10,
"result": "deleted", # 已被删除
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 15,
"_primary_term": 1
}
# 再次执行 DELETE http://192.168.56.56:9200/customer/external/1
{
"_index": "customer",
"_type": "external",
"_id": "1",
"_version": 11,
"result": "not_found", # 找不到
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 16,
"_primary_term": 1
}
# GET http://192.168.56.56:9200/customer/external/1
{
"_index": "customer",
"_type": "external",
"_id": "1",
"found": false # 找不到
}
删除索引
# 删除整个 customer 索引
# 删除前,GET http://192.168.56.56:9200/_cat/indices
green open .kibana-event-log-7.8.0-000001 NSvWWbd7SaqNmoJ6QmjIRg 1 0 1 0 5.3kb 5.3kb
green open .apm-custom-link mn9tqI-0QnOkI5JAp1rCHw 1 0 0 0 208b 208b
green open .kibana_task_manager_1 k5bSwn03TA-Hpisuzf677A 1 0 5 2 74.2kb 74.2kb
green open .apm-agent-configuration ZXRvqEdDSL2555OE8MyNSA 1 0 0 0 208b 208b
green open .kibana_1 _yCppL1mQ1a0-v88yOXNTQ 1 0 15 0 34.9kb 34.9kb
yellow open customer t6RCi8QZQoiEx-wxJQvlmw 1 1 5 0 4.6kb 4.6kb
# DELETE http://192.168.56.56:9200/customer
{
"acknowledged": true
}
# 删除后,GET http://192.168.56.56:9200/_cat/indices
green open .kibana-event-log-7.8.0-000001 NSvWWbd7SaqNmoJ6QmjIRg 1 0 1 0 5.3kb 5.3kb
green open .apm-custom-link mn9tqI-0QnOkI5JAp1rCHw 1 0 0 0 208b 208b
green open .kibana_task_manager_1 k5bSwn03TA-Hpisuzf677A 1 0 5 2 74.2kb 74.2kb
green open .apm-agent-configuration ZXRvqEdDSL2555OE8MyNSA 1 0 0 0 208b 208b
green open .kibana_1 _yCppL1mQ1a0-v88yOXNTQ 1 0 15 0 34.9kb 34.9kb
批量操作-bulk
这里的批量操作,指的是当发生某一条执行发生失败时,其他的数据仍然能够接着执行,也就是说彼此之间是独立的。
bulk api 以此按顺序执行所有的 action(动作)。如果一个单个的动作因任何原因失败,它将继续处理它后面剩余的动作。当 bulk api 返回时,它将提供每个动作的状态(与发送的顺序相同),所以可以检查一个指定的动作是否失败了。
注意,由于 bulk 不支持 json 或者 text 格式,所以不能在 postman 中测试,可以使用 kibana 的 DevTools。
实例 1:执行多条数据
# 在 kibana 的 DevTools 的控制台执行以下命令
POST /customer/external/_bulk
{"index":{"_id":"1"}}
{"name":"John Doe"}
{"index":{"_id":"2"}}
{"name":"John Doe"}
# 返回
#! Deprecation: [types removal] Specifying types in bulk requests is deprecated.
{
"took" : 368, # 命令花费时间
"errors" : false, # 没有发送任何错误
"items" : [ # 每个数据的结果
{
"index" : { # 第一条数据
"_index" : "customer",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201 # 新建完成
}
},
{
"index" : { # 第二条数据
"_index" : "customer",
"_type" : "external",
"_id" : "2",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 201
}
}
]
}
实例 2:对于整个索引执行批量操作
POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}
{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":"my first blog post"}
{"index":{"_index":"website","_type":"blog"}}
{"title":"my second blog post"}
{"update":{"_index":"website","_type":"blog","_id":"123"}}
{"doc":{"title":"my updated blog post"}}
# 返回
#! Deprecation: [types removal] Specifying types in bulk requests is deprecated.
{
"took" : 450,
"errors" : false,
"items" : [
{
"delete" : { # 删除
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 1,
"result" : "not_found",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 404
}
},
{
"create" : { # 创建
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 2,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1,
"status" : 201
}
},
{
"index" : { # 保存
"_index" : "website",
"_type" : "blog",
"_id" : "nxPjrHgBsa8QUaibx-rD",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 2,
"_primary_term" : 1,
"status" : 201
}
},
{
"update" : { # 更新
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 3,
"_primary_term" : 1,
"status" : 200
}
}
]
}
样本测试数据
一份顾客银行账户信息的虚构的 JSON 文档样本,文件地址
格式为:
{
"account_number": 1,
"balance": 39225,
"firstname": "Amber",
"lastname": "Duke",
"age": 32,
"gender": "M",
"address": "880 Holmes Lane",
"employer": "Pyrami",
"email": "amberduke@pyrami.com",
"city": "Brogan",
"state": "IL"
}
POST bank/account/_bulk
{"index":{"_id":"1"}}
{"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
...
GET http://192.168.56.56:9200/_cat/indices
green open .kibana-event-log-7.8.0-000001 NSvWWbd7SaqNmoJ6QmjIRg 1 0 1 0 5.3kb 5.3kb
yellow open website 3rGabFSISrq8ZwdXxP331g 1 1 2 2 8.8kb 8.8kb
yellow open bank ZpN0_upESxqV84IVAgyvJw 1 1 1000 0 397kb 397kb
green open .apm-custom-link mn9tqI-0QnOkI5JAp1rCHw 1 0 0 0 208b 208b
green open .kibana_task_manager_1 k5bSwn03TA-Hpisuzf677A 1 0 5 2 74.2kb 74.2kb
green open .apm-agent-configuration ZXRvqEdDSL2555OE8MyNSA 1 0 0 0 208b 208b
green open .kibana_1 _yCppL1mQ1a0-v88yOXNTQ 1 0 28 2 63.9kb 63.9kb
yellow open customer kYEsiy1iQWa2S_7JSsG9kQ 1 1 2 0 3.6kb 3.6kb
# 可以看到 bank 索引导入了 1000 条数据
进阶检索
SearchAPI
ES 支持两种基本方式检索:
-
通过 REST request uri 发送检索参数,即 uri + 检索参数
GET http://192.168.56.56:9200/bank/_search?q=*&sort=account_number:asc # q=* 表示查询所有 # sort 表示排序字段 # asc 表示升序 # 返回 { "took": 2, # 花费多少 ms 检索 "timed_out": false, # 是否超时 "_shards": { # 多少分片被搜索了,以及多少成功/失败的搜索分片 "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1000, # 多少匹配文档被找到 "relation": "eq" }, "max_score": null, # 文档相关性最高得分 "hits": [ { "_index": "bank", "_type": "account", "_id": "0", "_score": null, # 相关得分 "_source": { "account_number": 0, "balance": 16623, "firstname": "Bradshaw", "lastname": "Mckenzie", "age": 29, "gender": "F", "address": "244 Columbus Place", "employer": "Euron", "email": "bradshawmckenzie@euron.com", "city": "Hobucken", "state": "CO" }, "sort": [ # 结果的排序 key(列),没有的话按照 score 排序 0 ] }, // ... { "_index": "bank", "_type": "account", "_id": "9", "_score": null, "_source": { "account_number": 9, "balance": 24776, "firstname": "Opal", "lastname": "Meadows", "age": 39, "gender": "M", "address": "963 Neptune Avenue", "employer": "Cedward", "email": "opalmeadows@cedward.com", "city": "Olney", "state": "OH" }, "sort": [ 9 ] } ] } }
-
通过 REST request body,即 uri + 请求体
GET http://192.168.56.56:9200/bank/_search { "query": { "match_all": {} }, "sort": [ { "account_number": "asc" }, { "balance":"desc"} ] } # 返回 { "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1000, "relation": "eq" }, "max_score": null, "hits": [ { "_index": "bank", "_type": "account", "_id": "0", "_score": null, "_source": { "account_number": 0, "balance": 16623, "firstname": "Bradshaw", "lastname": "Mckenzie", "age": 29, "gender": "F", "address": "244 Columbus Place", "employer": "Euron", "email": "bradshawmckenzie@euron.com", "city": "Hobucken", "state": "CO" }, "sort": [ 0, 16623 ] }, // ... { "_index": "bank", "_type": "account", "_id": "9", "_score": null, "_source": { "account_number": 9, "balance": 24776, "firstname": "Opal", "lastname": "Meadows", "age": 39, "gender": "M", "address": "963 Neptune Avenue", "employer": "Cedward", "email": "opalmeadows@cedward.com", "city": "Olney", "state": "OH" }, "sort": [ 9, 24776 ] } ] } }
Query DSL
ES 提供了一个可以执行查询的 json 风格 的 DSL(Domain specific language,领域特定语言 ),被称为 Query DSL。
基本语法格式
一个查询语句的典型结构:
# 如果针对于某个字段,那么它的结构为:
{
QUERY_NAME:{ # 使用的功能
FIELD_NAME:{ # 功能参数
ARGUMENT:VALUE,
ARGUMENT:VALUE,...
}
}
}
查询示例:
query
定义如何查询,match_all
代表查询所有的索引from
代表从第几条文档开始查询,size
代表查询文档个数,通常组合起来完成分页功能sort
代表排序,多字段排序时,会在前序字段相等时后续字段内部排序,否则以前序为准
GET http://192.168.56.56:9200/bank/_search
{
"query": { # 查询的字段
"match_all": {}
},
"from": 0, # 从第几条文档开始查
"size": 5,
"_source":["balance", "firstname"], # 要返回的字段
"sort": [
{
"account_number": { # 返回结果按哪个列排序
"order": "desc" # 降序
}
}
]
}
# 返回
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1000,
"relation": "eq"
},
"max_score": null,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "999",
"_score": null,
"_source": {
"firstname": "Dorothy",
"balance": 6087
},
"sort": [
999
]
},
{
"_index": "bank",
"_type": "account",
"_id": "998",
"_score": null,
"_source": {
"firstname": "Letha",
"balance": 16869
},
"sort": [
998
]
},
{
"_index": "bank",
"_type": "account",
"_id": "997",
"_score": null,
"_source": {
"firstname": "Combs",
"balance": 25311
},
"sort": [
997
]
},
{
"_index": "bank",
"_type": "account",
"_id": "996",
"_score": null,
"_source": {
"firstname": "Andrews",
"balance": 17541
},
"sort": [
996
]
},
{
"_index": "bank",
"_type": "account",
"_id": "995",
"_score": null,
"_source": {
"firstname": "Phelps",
"balance": 21153
},
"sort": [
995
]
}
]
}
}
query/match 匹配查询
如果是非字符串,会进行精确匹配。如果是字符串,会进行全文检索。
-
非字符串(基本类型),精确匹配
GET http://192.168.56.56:9200/bank/_search { "query": { "match": { "account_number": "20" } } } # 返回 { "took": 10, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, # 得到一条记录 "relation": "eq" }, "max_score": 1.0, # 最大得分 "hits": [ { "_index": "bank", "_type": "account", "_id": "20", "_score": 1.0, "_source": { # 文档信息 "account_number": 20, "balance": 16418, "firstname": "Elinor", "lastname": "Ratliff", "age": 36, "gender": "M", "address": "282 Kings Place", "employer": "Scentric", "email": "elinorratliff@scentric.com", "city": "Ribera", "state": "WA" } } ] } }
-
字符串,全文检索,最终会按照评分进行排序,会对检索条件进行分词匹配。这是因为维护了一个倒排索引表。
GET http://192.168.56.56:9200/bank/_search { "query": { "match": { "address": "kings" } } } # 返回 { "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 2, # 得到两条记录 "relation": "eq" }, "max_score": 5.990829, # 最大得分 "hits": [ { "_index": "bank", "_type": "account", "_id": "20", "_score": 5.990829, # 得分 "_source": { # 文档信息 "account_number": 20, "balance": 16418, "firstname": "Elinor", "lastname": "Ratliff", "age": 36, "gender": "M", "address": "282 Kings Place", "employer": "Scentric", "email": "elinorratliff@scentric.com", "city": "Ribera", "state": "WA" } }, { "_index": "bank", "_type": "account", "_id": "722", "_score": 5.990829, "_source": { "account_number": 722, "balance": 27256, "firstname": "Roberts", "lastname": "Beasley", "age": 34, "gender": "F", "address": "305 Kings Hwy", "employer": "Quintity", "email": "robertsbeasley@quintity.com", "city": "Hayden", "state": "PA" } } ] } }
query/match_phrase 不拆分匹配查询
将需要匹配的值当成一整个单(不进行拆分)进行检索。
match_phrase
是做短语匹配,只要文本中包含匹配条件,就能匹配到。- 文本字段的匹配,使用
keyword
,匹配的条件就是要显示字段的全部值,要进行精确匹配的。
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"match_phrase": {
"address": "mill road" # 不要匹配只有 mill 或只有 road 的,要匹配 mill road 一整个子串
}
}
}
# 返回
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 8.926605,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "970",
"_score": 8.926605,
"_source": {
"account_number": 970,
"balance": 19648,
"firstname": "Forbes",
"lastname": "Wallace",
"age": 28,
"gender": "M",
"address": "990 Mill Road", # Mill Road
"employer": "Pheast",
"email": "forbeswallace@pheast.com",
"city": "Lopezo",
"state": "AK"
}
}
]
}
}
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"match": {
"address.keyword": "mill road" # 精准全部匹配
}
}
}
# 返回
{
"took": 14,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"match": {
"address.keyword": "990 Mill Road" # 精准全部匹配,而且区分大小写
}
}
}
# 返回
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 6.5032897,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "970",
"_score": 6.5032897,
"_source": {
"account_number": 970,
"balance": 19648,
"firstname": "Forbes",
"lastname": "Wallace",
"age": 28,
"gender": "M",
"address": "990 Mill Road",
"employer": "Pheast",
"email": "forbeswallace@pheast.com",
"city": "Lopezo",
"state": "AK"
}
}
]
}
}
query/multi_match 多字段匹配查询
state 或者 address 中包含 mill,并且在查询过程中,会对于查询条件进行分词。
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"multi_match": { # 指定多个字段
"query": "mill",
"fields": [ # state 和 address 有 mill 子串,但不要求都有
"state",
"address"
]
}
}
}
# 返回
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 5.4032025,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "970",
"_score": 5.4032025,
"_source": {
"account_number": 970,
"balance": 19648,
"firstname": "Forbes",
"lastname": "Wallace",
"age": 28,
"gender": "M",
"address": "990 Mill Road",
"employer": "Pheast",
"email": "forbeswallace@pheast.com",
"city": "Lopezo",
"state": "AK"
}
},
{
"_index": "bank",
"_type": "account",
"_id": "136",
"_score": 5.4032025,
"_source": {
"account_number": 136,
"balance": 45801,
"firstname": "Winnie",
"lastname": "Holland",
"age": 38,
"gender": "M",
"address": "198 Mill Lane",
"employer": "Neteria",
"email": "winnieholland@neteria.com",
"city": "Urie",
"state": "IL"
}
},
{
"_index": "bank",
"_type": "account",
"_id": "345",
"_score": 5.4032025,
"_source": {
"account_number": 345,
"balance": 9812,
"firstname": "Parker",
"lastname": "Hines",
"age": 38,
"gender": "M",
"address": "715 Mill Avenue",
"employer": "Baluba",
"email": "parkerhines@baluba.com",
"city": "Blackgum",
"state": "KY"
}
},
{
"_index": "bank",
"_type": "account",
"_id": "472",
"_score": 5.4032025,
"_source": {
"account_number": 472,
"balance": 25571,
"firstname": "Lee",
"lastname": "Long",
"age": 32,
"gender": "F",
"address": "288 Mill Street",
"employer": "Comverges",
"email": "leelong@comverges.com",
"city": "Movico",
"state": "MT"
}
}
]
}
}
query/bool/must 复合匹配查询
复合语句必须合并,任何其他查询语句,包括符号语句。这也意味着,复合语句之间可以相互嵌套,可以表达非常复杂的逻辑。
must
必须匹配的条件must_not
必须不匹配的条件should
应该匹配的条件,满足最好,不满足也可以,满足了得分更高- 注意:should 列举的条件,如果到达会增加相关文档的评分,并不会改变查询的结果。如果 query 中有且只有 should 一种匹配规则,那么 should 的条件就会被作为默认匹配条件去改变查询结果。
# 查询 gender=m,并且 address=mill 的数据
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"address": "mill"
}
},
{
"match": {
"gender": "M"
}
}
]
}
}
}
# 查询 gender=m,并且 address=mill,但是 age!=38 的数据
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"gender": "M"
}
},
{
"match": {
"address": "mill"
}
}
],
"must_not": [
{
"match": {
"age": "38"
}
}
]
}
}
}
# 查询 gender=m,并且 address=mill,但是 age!=18,lastName 应该等于 Wallace 的数据
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"gender": "M"
}
},
{
"match": {
"address": "mill"
}
}
],
"must_not": [
{
"match": {
"age": "18"
}
}
],
"should": [
{
"match": {
"lastname": "Wallace"
}
}
]
}
}
}
query/filter 查询结果过滤
并不是所有的查询都需要产生分数,特别是哪些仅用于过滤的文档。为了不计算分数,ES 会自动检查场景并且优化查询的执行。must_not 也是一种 filter,所以也不会贡献得分。显然这样查询速度会更快。总结为:
- must 贡献得分
- should 贡献得分
- must_not 不贡献得分
- filter 不贡献得分
# 查询所有匹配 address=mill 的文档,然后再根据 10000<=balance<=20000 进行过滤查询结果
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"address": "mill"
}
}
],
"filter": {
"range": {
"balance": {
"gte": "10000",
"lte": "20000"
}
}
}
}
}
}
# 返回
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 5.4032025,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "970",
"_score": 5.4032025,
"_source": {
"account_number": 970,
"balance": 19648,
"firstname": "Forbes",
"lastname": "Wallace",
"age": 28,
"gender": "M",
"address": "990 Mill Road",
"employer": "Pheast",
"email": "forbeswallace@pheast.com",
"city": "Lopezo",
"state": "AK"
}
}
]
}
}
# 单纯的过滤
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"bool": {
"filter": {
"range": {
"balance": {
"gte": "10000",
"lte": "20000"
}
}
}
}
}
}
# 返回
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 213,
"relation": "eq"
},
"max_score": 0.0,
"hits": [
{
"_index": "bank",
"_type": "account",
"_id": "20",
"_score": 0.0, # 不得分
"_source": {
"account_number": 20,
"balance": 16418,
"firstname": "Elinor",
"lastname": "Ratliff",
"age": 36,
"gender": "M",
"address": "282 Kings Place",
"employer": "Scentric",
"email": "elinorratliff@scentric.com",
"city": "Ribera",
"state": "WA"
}
},
// ...
{
"_index": "bank",
"_type": "account",
"_id": "272",
"_score": 0.0, # 不得分
"_source": {
"account_number": 272,
"balance": 19253,
"firstname": "Lilly",
"lastname": "Morgan",
"age": 25,
"gender": "F",
"address": "689 Fleet Street",
"employer": "Biolive",
"email": "lillymorgan@biolive.com",
"city": "Sunbury",
"state": "OH"
}
}
]
}
}
query/term 非 text 字段匹配查询
它和 query/match 一样,能匹配某个属性的值,但是 全文检索字段时用 match,其他非 text 字段时用 term。因为 ES 默认存储 text 值时用分词分析。
aggs/aggName 聚合
聚合提供了从数据中分组和提取数据的能力,最简单的聚合方法类似于 SQL 的 group by
和 聚合函数
等。
在 ES 中,执行搜索返回 hits(命中结果),并且同时返回聚合结果。把已响应的所有命中结果分隔开的能力是非常实用的。可以执行查询和多个聚合,并且在一次使用中得到各自的返回结果,使用一次简洁和简化的 API 可以避免网络往返。
聚合基本语法格式:
"aggs":{ # 聚合
"aggs_name":{ # 聚合的名字,方便展示在结果集中
"AGG_TYPE":{} # 聚合的类型(avg,term,terms)
}
}
# terms 看值的可能性分布,会合并锁查字段,给出计数即可
# avg 看值的分布平均
搜索 address 中包含 mill 的所有人的年龄分布以及平均年龄,但不显示这些人的详情:
GET http://192.168.56.56:9200/bank/_search
{
"query": { # 查询出包含 mill 的
"match": {
"address": "Mill"
}
},
"aggs": { # 基于查询聚合
"ageAgg": { # 第一个聚合,聚合的名字,可以随便起
"terms": { # 看值的可能性分布
"field": "age",
"size": 10
}
},
"ageAvg": { # 第二个聚合
"avg": { # 看 age 值的平均
"field": "age"
}
},
"balanceAvg": { # 第三个聚合
"avg": { # 看 balance 的平均
"field": "balance"
}
}
},
"size": 0 # 不看详情
}
# 返回
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4, # 命中 4 条记录
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"ageAgg": { # ageAgg 聚合结果
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 38, # age=38 有 2 条记录
"doc_count": 2
},
{
"key": 28,
"doc_count": 1
},
{
"key": 32,
"doc_count": 1
}
]
},
"ageAvg": {
"value": 34.0
},
"balanceAvg": {
"value": 25208.0
}
}
}
aggs/aggName/aggs/aggName 子聚合
按照年龄聚合,求这些年龄段的这些人的平均薪资:
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAgg": {
"terms": { # 看值的可能性分布
"field": "age",
"size": 100
},
"aggs": { # 与 terms 并列
"ageAvg": {
"avg": {
"field": "balance"
}
}
}
}
},
"size": 0
}
# 返回
{
"took": 60,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1000,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"ageAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 31,
"doc_count": 61,
"ageAvg": {
"value": 28312.918032786885
}
},
// ...
{
"key": 29,
"doc_count": 35,
"ageAvg": {
"value": 29483.14285714286
}
}
]
}
}
}
查出所有年龄分布,并且这些年龄段中 M 的平均薪资和 F 的平均薪资以及这个年龄段的总体平均薪资:
GET http://192.168.56.56:9200/bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAgg": {
"terms": { # age 的分布
"field": "age",
"size": 100
},
"aggs": { # 子聚合
"genderAgg": { #
"terms": { # gender 的分布
"field": "gender.keyword" # 使用 .keyword
},
"aggs": {
"balanceAvg": {
"avg": {
"field": "balance"
}
}
}
},
"ageBalanceAvg": { #
"avg": {
"field": "balance"
}
}
}
}
},
"size": 0
}
# 返回
{
"took": 82,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1000,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"ageAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 31,
"doc_count": 61,
"genderAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 35,
"balanceAvg": {
"value": 29565.628571428573
}
},
{
"key": "F",
"doc_count": 26,
"balanceAvg": {
"value": 26626.576923076922
}
}
]
},
"ageBalanceAvg": {
"value": 28312.918032786885
}
},
// ...
{
"key": 29,
"doc_count": 35,
"genderAgg": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "M",
"doc_count": 23,
"balanceAvg": {
"value": 29943.17391304348
}
},
{
"key": "F",
"doc_count": 12,
"balanceAvg": {
"value": 28601.416666666668
}
}
]
},
"ageBalanceAvg": {
"value": 29483.14285714286
}
}
]
}
}
}
nested 对象聚合
参考:Elasticsearch 中使用 nested 类型的内嵌对象
Mapping 字段映射
映射是定义文档及其包含的字段的存储和索引方式的过程。每个文档都是字段的集合,每个字段都有自己的 数据类型。映射数据时,将创建一个映射定义,其中包含与文档相关的字段列表。
字段类型
核心类型:
- 字符串
text
用于全文索引,搜索时会自动使用分词器进行分词再匹配keyword
部分此,搜索时精确完整匹配
- 数字类型
- 整型:byte,short,integer,long
- 浮点型:float, half_float, scaled_float,double
- 日期类型
- 布尔类型
- 二进制类型
复杂类型:
- 数组类型
- 对象类型
- 嵌套类型
地理类型:
- 地理坐标
- 地理图标
查看映射
使用 mapping 来定义:
- 哪些字符串属性应该被看做 全文本属性(full text fields);
- 哪些属性包含数字,日期或地理位置;
- 文档中的所有属性是否都嫩被索引(all 配置);
- 日期的格式;
- 自定义映射规则来执行动态添加属性;
# 查看索引
GET /bank/_mapping
{
"bank" : {
"mappings" : {
"properties" : {
"account_number" : {
"type" : "long" # long 类型
},
"address" : {
"type" : "text", # text 类型,会进行全文检索,进行分词匹配
"fields" : {
"keyword" : {
"type" : "keyword", # 精确匹配
"ignore_above" : 256
}
}
},
"age" : {
"type" : "long"
},
"balance" : {
"type" : "long"
},
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"employer" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"firstname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"gender" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"lastname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"state" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
创建映射
# 创建映射
PUT /my_index
{
"mappings": {
"properties": {
"age": {
"type": "integer"
},
"email": {
"type": "keyword"
},
"name": {
"type": "text"
}
}
}
}
# 输出
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "my_index"
}
# 查看映射
GET /my_index
# 输出
{
"my_index" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"email" : {
"type" : "keyword"
},
"name" : {
"type" : "text"
}
}
},
"settings" : {
"index" : {
"creation_date" : "1617960990447",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "KgYd5GOPR0uc5kEbUCeBDg",
"version" : {
"created" : "7080099"
},
"provided_name" : "my_index"
}
}
}
}
# 添加新的字段映射
PUT /my_index/_mapping
{
"properties": {
"employee-id": {
"type": "keyword",
"index": false # 表示字段不能被检索
}
}
}
更新映射
对于已经存在的字段映射,我们不能更新,因为更改现有字段可能会使已经建立索引的数据无效。要更新必须创建新的索引,进行数据迁移。具体操作为:
# 先创建新的索引,然后进行数据迁移
# 6.0 之后的写法
POST reindex
{
"source":{
"index":"old_index"
},
"dest":{
"index":"new_index"
}
}
# 老版本写法
POST reindex
{
"source":{
"index":"old_index",
"type":"old_type"
},
"dest":{
"index":"new_index"
}
}
案例: 原来 bank 索引的类型为 account,新版本没有类型了,所以我们把它去掉。
GET /bank/_search
# 输出
{
"took" : 19,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "bank",
"_type" : "account", # 有类型
"_id" : "1",
"_score" : 1.0,
"_source" : {
"account_number" : 1,
"balance" : 39225,
"firstname" : "Amber",
"lastname" : "Duke",
"age" : 32,
"gender" : "M",
"address" : "880 Holmes Lane",
"employer" : "Pyrami",
"email" : "amberduke@pyrami.com",
"city" : "Brogan",
"state" : "IL"
}
},
// ...
]
}
}
# 先建立新的索引
PUT /newbank
{
"mappings": {
"properties": {
"account_number": {
"type": "long"
},
"address": {
"type": "text"
},
"age": {
"type": "integer"
},
"balance": {
"type": "long"
},
"city": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"employer": {
"type": "keyword"
},
"firstname": {
"type": "text"
},
"gender": {
"type": "keyword"
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"state": {
"type": "keyword"
}
}
}
}
# 查看新的映射
GET /newbank/_mapping
# 返回
{
"newbank" : {
"mappings" : {
"properties" : {
"account_number" : {
"type" : "long"
},
"address" : {
"type" : "text"
},
"age" : {
"type" : "integer" # 改为了 integer
},
"balance" : {
"type" : "long"
},
"city" : {
"type" : "keyword"
},
"email" : {
"type" : "keyword"
},
"employer" : {
"type" : "keyword"
},
"firstname" : {
"type" : "text"
},
"gender" : {
"type" : "keyword"
},
"lastname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"state" : {
"type" : "keyword"
}
}
}
}
# 进行迁移
POST _reindex
{
"source": {
"index": "bank",
"type": "account"
},
"dest": {
"index": "newbank"
}
}
# 输出
#! Deprecation: [types removal] Specifying types in reindex requests is deprecated.
{
"took" : 918,
"timed_out" : false,
"total" : 1000,
"updated" : 0,
"created" : 1000,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
# 查看 newbank
GET /newbank/_search
{
"took" : 511,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "newbank",
"_type" : "_doc", # 没有了类型
"_id" : "1",
"_score" : 1.0,
"_source" : {
"account_number" : 1,
"balance" : 39225,
"firstname" : "Amber",
"lastname" : "Duke",
"age" : 32,
"gender" : "M",
"address" : "880 Holmes Lane",
"employer" : "Pyrami",
"email" : "amberduke@pyrami.com",
"city" : "Brogan",
"state" : "IL"
}
},
// ...
]
}
}
分词
一个 tokenizer(分词器)接收一个字符流,将之分割为独立的tokens
(词元,通常是独立的单词),然后输出 tokens 流。
例如:whitespace tokenizer 遇到空白字符时分割文本。它会将文本"Quick brown fox!"
分割为[Quick,brown,fox!]
。
该 tokenizer(分词器)还负责记录各个 terms(词条) 的顺序或 position 位置(用于 phrase 短语和 word proximity 词近邻查询),以及 term(词条)所代表的原始 word(单词)的 start(起始)和 end(结束)的 character offsets(字符串偏移量)(用于高亮显示搜索的内容)。
elasticsearch提供了很多内置的分词器(标准分词器),可以用来构建 custom analyzers(自定义分词器)。更多可参考
标准分词器的使用:
POST _analyze
{
"analyzer": "standard",
"text": "The 2 Brown-Foxes bone."
}
# 输出
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "brown",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "foxes",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "bone",
"start_offset" : 18,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 4
}
]
}
所有的语言分词,默认使用的都是 “Standard Analyzer”,但是这些分词器针对于中文的分词,并不友好。为此需要安装中文的分词器。推荐使用 elasticsearch-analysis-ik。
安装 ik 分词器
-
查看 ES 版本
http://192.168.56.56:9200/ { "name": "0f6d6c60bc96", "cluster_name": "elasticsearch", "cluster_uuid": "sDTdW7KnQayVrFC5ioijiQ", "version": { "number": "7.8.0", # 7.8.0 "build_flavor": "default", "build_type": "docker", "build_hash": "757314695644ea9a1dc2fecd26d1a43856725e65", "build_date": "2020-06-14T19:35:50.234439Z", "build_snapshot": false, "lucene_version": "8.5.1", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1" }, "tagline": "You Know, for Search" }
-
由于使用 Docker 安装 ES 时,进行了路径映射,所以直接进入 ES 的 plugins 目录
cd docker/elasticsearch7.8.0/plugins # 安装 waget yum install wget # 安装 unzip yum install unzip # 下载 ik 压缩包 wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.8.0/elasticsearch-analysis-ik-7.8.0.zip # 解压 ik unzip elasticsearch-analysis-ik-7.8.0.zip -d ik # 更改权限 chmod -R 777 ik # 删除 ik 压缩包 rm -rf elasticsearch-analysis-ik-7.8.0.zip # 重启 ES docker restart elasticsearch7.8.0
测试分词器
# 使用默认分词器
GET _analyze
{
"text":"我是中国人"
}
# 输出
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "中",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "国",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
# 使用 ik
GET _analyze
{
"analyzer": "ik_smart",
"text":"我是中国人"
}
# 输出
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}
]
}
GET _analyze
{
"analyzer": "ik_max_word",
"text":"我是中国人"
}
# 输出
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 4
}
]
}
自定义词库
-
在 Nginx 的映射文件夹的 html 文件夹下创建 es 文件夹,用于保存 es相关数据
mkdir es
-
创建 fenci.txt 文件,将分词数据存放在此文件中
cd es/ # 加入 高富帅 刘德华子 等自定义词 vi fenci.txt 访问 http://192.168.56.56/es/fenci.txt
-
修改 plugins/ik/config 中的 IKAnalyzer.cfg.xml
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--用户可以在这里配置自己的扩展字典 --> <entry key="ext_dict"></entry> <!--用户可以在这里配置自己的扩展停止词字典--> <entry key="ext_stopwords"></entry> <!--用户可以在这里配置远程扩展字典 --> <entry key="remote_ext_dict">http://192.168.56.56/es/fenci.txt</entry> <!--用户可以在这里配置远程扩展停止词字典--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties> # 重启 ES docker restart elasticsearch7.8.0
注意:更新完成后,ES 只会对于新增的数据用更新分词。历史数据是不会重新分词的。如果想要历史数据重新分词,需要执行 POST my_index/_update_by_query?conflicts=proceed
测试:
GET _analyze
{
"analyzer": "ik_smart",
"text":"我是高富帅刘德华子"
}
# 输出
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_CHAR",
"position" : 0
},
{
"token" : "是",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_CHAR",
"position" : 1
},
{
"token" : "高富帅",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "刘德华子",
"start_offset" : 5,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 3
}
]
}
ES REST CLIENT
Java 操作 ES 有两种方式:
- 通过 9300 端口,以 TCP 方式
- 使用 spring-data-elasticsearch:transport-api.jar
- springboot 版本不同,ransport-api.jar 不同,不能适配 ES 版本
- 7.x 已经不建议使用,8 以后就要废弃
- 具体可参考:Java API (deprecated)
- 通过 9200 端口,以 HTTP 方式
- jestClient: 非官方,更新慢
- HttpClient、RestTemplate:模拟 HTTP 请求,ES 很多操作需要自己封装,麻烦
Elasticsearch-Rest-Client
:官方 RestClient,封装了 ES 操作,API 层次分明,上手简单,推荐使用- Elasticsearch-Rest-Client 具体可参考:Java REST Client,并且使用 Java High Level REST Client,它与 Java Low Level REST Client 的区别类似于 MyBatis 和 JDBC。
SpringBoot 整合 ES
-
创建 SpringBoot 项目,选择 Web 依赖,但是不要选择 ES 依赖
-
导入依赖
<!-- ES Rest API--> <dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>elasticsearch-rest-high-level-client</artifactId> <version>7.8.0</version> </dependency> # 在 spring-boot-dependencies 中所依赖的ES版本位 6.8.5,要改掉 <properties> <java.version>1.8</java.version> <spring-cloud.version>Hoxton.SR8</spring-cloud.version> <elasticsearch.version>7.8.0</elasticsearch.version> </properties>
-
编写 Elasticsearch 配置类
package cn.parzulpan.shopping.search.config; import org.apache.http.HttpHost; import org.elasticsearch.client.RequestOptions; import org.elasticsearch.client.RestClient; import org.elasticsearch.client.RestClientBuilder; import org.elasticsearch.client.RestHighLevelClient; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; /** * @author parzulpan * @version 1.0 * @date 2021-04 * @project shopping * @package cn.parzulpan.shopping.search.config * @desc Elasticsearch 配置类 */ @Configuration public class ShoppingElasticsearchConfig { // 请求测试项,比如 es 添加了安全访问规则,访问 es 需要添加一个安全头,就可以通过 requestOptions 设置 // 官方建议把 requestOptions 创建成单实例 public static final RequestOptions COMMON_OPTIONS; static { RequestOptions.Builder builder = RequestOptions.DEFAULT.toBuilder(); COMMON_OPTIONS = builder.build(); } @Bean public RestHighLevelClient restHighLevelClient() { RestClientBuilder builder = null; // 可以指定多个 ES builder = RestClient.builder(new HttpHost("192.168.56.56", 9200, "http")); return new RestHighLevelClient(builder); } }
-
实例测试
package cn.parzulpan.shopping.search; import org.elasticsearch.client.RestHighLevelClient; import org.junit.jupiter.api.Test; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.boot.test.context.SpringBootTest; @SpringBootTest class ShoppingSearchApplicationTests { @Autowired RestHighLevelClient client; @Test void contextLoads() { } @Test void testRestClient() { System.out.println(client); } }
-
保存数据
@Data class User { private String userName; private Integer age; private String gender; } /** * https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.x/java-rest-high-create-index.html * 保存方式分为同步和异步 */ @Test void indexData() throws IOException { // 设置索引 IndexRequest users = new IndexRequest("users"); users.id("1"); //设置要保存的内容,指定数据和类型 // 方式一 // users.source("userName", "zhang", "age", 18, "gender", "男"); // 方式二 User user = new User(); user.setUserName("wang"); user.setAge(20); user.setGender("女"); Gson gson = new Gson(); String userJson = gson.toJson(user); users.source(userJson, XContentType.JSON); // 执行创建索引和保存数据 IndexResponse index = client.index(users, ShoppingElasticsearchConfig.COMMON_OPTIONS); System.out.println(index); }
-
获取数据
/** * ES 获取数据 * https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.x/java-rest-high-search.html * 搜索 address 中包含 mill 的所有人的年龄分布以及平均年龄 * GET /bank/_search * { * "query": { # 查询出包含 mill 的 * "match": { * "address": "Mill" * } * }, * "aggs": { # 基于查询聚合 * "ageAgg": { # 第一个聚合,聚合的名字,可以随便起 * "terms": { # 看值的可能性分布 * "field": "age", * "size": 10 * } * }, * "ageAvg": { # 第二个聚合 * "avg": { # 看 age 值的平均 * "field": "age" * } * }, * "balanceAvg": { # 第三个聚合 * "avg": { # 看 balance 的平均 * "field": "balance" * } * } * }, * "size": 0 # 不看详情 * } */ @Test void find() throws IOException { // 1. 创建检索请求 SearchRequest searchRequest = new SearchRequest(); searchRequest.indices("bank"); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); // 构造检索条件 // searchSourceBuilder.query(); // searchSourceBuilder.from(); // searchSourceBuilder.size(); // searchSourceBuilder.aggregation(); searchSourceBuilder.query(QueryBuilders.matchQuery("address", "mill")); // 构建第一个聚合条件:看值的可能性分布 TermsAggregationBuilder ageAgg = AggregationBuilders.terms("ageAgg").field("age").size(10); searchSourceBuilder.aggregation(ageAgg); // 构建第二个聚合条件:看 age 值的平均 AvgAggregationBuilder ageAvg = AggregationBuilders.avg("ageAvg").field("age"); searchSourceBuilder.aggregation(ageAvg); // 构建第三个聚合条件:看 balance 的平均 AvgAggregationBuilder balanceAvg = AggregationBuilders.avg("balanceAvg").field("balance"); searchSourceBuilder.aggregation(balanceAvg); // 不看详情 // searchSourceBuilder.size(0); System.out.println("searchSourceBuilder " + searchSourceBuilder.toString()); searchRequest.source(searchSourceBuilder); // 2. 执行检索 SearchResponse response = client.search(searchRequest, ShoppingElasticsearchConfig.COMMON_OPTIONS); // 3. 分析响应结果 System.out.println("response " + response.toString()); // 3.1 将响应结果转换为 Bean SearchHits hits = response.getHits(); SearchHit[] hits1 = hits.getHits(); Gson gson = new Gson(); for (SearchHit hit: hits1) { System.out.println("id: " + hit.getId()); System.out.println("index: " + hit.getIndex()); String sourceAsString = hit.getSourceAsString(); System.out.println("sourceAsString: " + sourceAsString); System.out.println("Account: " + gson.fromJson(sourceAsString, Account.class)); } // 3.2 获取检索到的分析信息 Aggregations aggregations = response.getAggregations(); Terms ageAgg1 = aggregations.get("ageAgg"); for (Terms.Bucket bucket : ageAgg1.getBuckets()) { System.out.println("ageAgg: " + bucket.getKeyAsString() + " => " + bucket.getDocCount()); } Avg ageAvg1 = aggregations.get("ageAvg"); System.out.println("ageAvg: " + ageAvg1.getValue()); Avg balanceAvg1 = aggregations.get("balanceAvg"); System.out.println("balanceAvg: " + balanceAvg1.getValue()); }