zoukankan      html  css  js  c++  java
  • elasticsearch 之多种搜索方式

    转载自:https://blog.csdn.net/wuzhiwei549/article/details/80362147

    query string search
    搜索全部商品:GET /ecommerce/product/_search

    took:耗费了几毫秒
    timed_out:是否超时,这里是没有
    _shards:数据拆成了5个分片,所以对于搜索请求,会打到所有的primary shard(或者是它的某个replica shard也可以)
    hits.total:查询结果的数量,3个document
    hits.max_score:score的含义,就是document对于一个search的相关度的匹配分数,越相关,就越匹配,分数也高
    hits.hits:包含了匹配搜索的document的详细数据
    {
    "took": 2,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "2",
    "_score": 1,
    "_source": {
    "name": "jiajieshi yagao",
    "desc": "youxiao fangzhu",
    "price": 25,
    "producer": "jiajieshi producer",
    "tags": [
    "fangzhu"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "1",
    "_score": 1,
    "_source": {
    "name": "gaolujie yagao",
    "desc": "gaoxiao meibai",
    "price": 30,
    "producer": "gaolujie producer",
    "tags": [
    "meibai",
    "fangzhu"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "3",
    "_score": 1,
    "_source": {
    "name": "zhonghua yagao",
    "desc": "caoben zhiwu",
    "price": 40,
    "producer": "zhonghua producer",
    "tags": [
    "qingxin"
    ]
    }
    }
    ]
    }
    }
    query string search的由来,因为search参数都是以http请求的query string来附带的

    搜索商品名称中包含yagao的商品,而且按照售价降序排序:GET /ecommerce/product/_search?q=name:yagao&sort=price:desc

    适用于临时的在命令行使用一些工具,比如curl,快速的发出请求,来检索想要的信息;但是如果查询请求很复杂,是很难去构建的
    在生产环境中,几乎很少使用query string search

    query DSL
    DSL:Domain Specified Language,特定领域的语言
    http request body:请求体,可以用json的格式来构建查询语法,比较方便,可以构建各种复杂的语法,比query string search肯定强大多了

    查询所有的商品
    GET /ecommerce/product/_search
    {
    "query": { "match_all": {} }
    }
    查询名称包含yagao的商品,同时按照价格降序排序
    GET /ecommerce/product/_search
    {
    "query" : {
    "match" : {
    "name" : "yagao"
    }
    },
    "sort": [
    { "price": "desc" }
    ]
    }
    分页查询商品,总共3条商品,假设每页就显示1条商品,现在显示第2页,所以就查出来第2个商品
    GET /ecommerce/product/_search
    {
    "query": { "match_all": {} },
    "from": 1,
    "size": 1
    }
    指定要查询出来商品的名称和价格就可以
    GET /ecommerce/product/_search
    {
    "query": { "match_all": {} },
    "_source": ["name", "price"]
    }
    更加适合生产环境的使用,可以构建复杂的查询
    multi match
    查询test_field 或 test_field1列中包含test

    GET /test_index/test_type/_search
    {
    "query": {
    "multi_match": {
    "query": "test",
    "fields": ["test_field", "test_field1"]
    }
    }
    }

    bool
    用bool组合多个搜索条件,来搜索name

    <span style="font-weight:normal;">GET /ecommerce/product/_search
    {
    "query": {
    "bool": {
    "must": { "match": { "name": "gaolujie" }},
    "must_not": { "match": { "name": "jiajieshi" }},
    "should": [
    { "match": { "title": "gaolujie" }},
    { "match": { "title": "lengsuanling" }}
    ]
    }
    }
    }</span>
    控制搜索结果的精准度的第二步:指定一些关键字中,必须至少匹配其中50%的关键字,才能作为结果返回

    <span style="font-weight:normal;">GET /ecommerce/product/_search
    {
    "query": {
    "match": {
    "title": {
    "query": "gaolujie zhonghua yagao",
    "minimum_should_match": "50%"
    }
    }
    }
    }</span>
    query filter
    搜索商品名称包含yagao,而且售价大于25元的商品

    GET /ecommerce/product/_search
    {
    "query" : {
    "bool" : {
    "must" : {
    "match" : {
    "name" : "yagao"
    }
    },
    "filter" : {
    "range" : {
    "price" : { "gt" : 25 }
    }
    }
    }
    }
    }
    full-text search(全文检索)
    GET /ecommerce/product/_search
    {
    "query" : {
    "match" : {
    "producer" : "yagao producer"
    }
    }
    }
    producer这个字段,会先被拆解,建立倒排索引

    special 4
    yagao 4
    producer 1,2,3,4
    gaolujie 1
    zhognhua 3
    jiajieshi 2

    yagao producer ---> yagao 和 producer

    {
    "took": 4,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 4,
    "max_score": 0.70293105,
    "hits": [
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "4",
    "_score": 0.70293105,
    "_source": {
    "name": "special yagao",
    "desc": "special meibai",
    "price": 50,
    "producer": "special yagao producer",
    "tags": [
    "meibai"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "1",
    "_score": 0.25811607,
    "_source": {
    "name": "gaolujie yagao",
    "desc": "gaoxiao meibai",
    "price": 30,
    "producer": "gaolujie producer",
    "tags": [
    "meibai",
    "fangzhu"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "3",
    "_score": 0.25811607,
    "_source": {
    "name": "zhonghua yagao",
    "desc": "caoben zhiwu",
    "price": 40,
    "producer": "zhonghua producer",
    "tags": [
    "qingxin"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "2",
    "_score": 0.1805489,
    "_source": {
    "name": "jiajieshi yagao",
    "desc": "youxiao fangzhu",
    "price": 25,
    "producer": "jiajieshi producer",
    "tags": [
    "fangzhu"
    ]
    }
    }
    ]
    }
    }
    搜索结果精准控制的第一步:灵活使用and关键字,如果你是希望所有的搜索关键字都要匹配的,那么就用and,可以实现单纯match query无法实现的效果
    GET /ecommerce/product/_search
    {
    "query": {
    "match": {
    "title": {
    "query": "java elasticsearch",
    "operator": "and"
    }
    }
    }
    }
    如果对一个string field进行排序,结果往往不准确,因为分词后是多个单词,再排序就不是我们想要的结果了
    通常解决方案是,将一个string field建立两次索引,一个分词,用来进行搜索;一个不分词,用来进行排序(后续篇章讲解)
    相当于

    <span style="font-weight:normal;">{
    "bool": {
    "must": [
    { "term": { "title": "java" }},
    { "term": { "title": "elasticsearch" }}
    ]
    }
    }</span>
    phrase search (短语搜索)
    跟全文检索相对应,相反,全文检索会将输入的搜索串拆解开来,去倒排索引里面去一一匹配,只要能匹配上任意一个拆解后的单词,就可以作为结果返回
    phrase search,要求输入的搜索串,必须在指定的字段文本中,完全包含一模一样的,才可以算匹配,才能作为结果返回
    GET /ecommerce/product/_search
    {
    "query" : {
    "match_phrase" : {
    "producer" : "yagao producer"
    }
    }
    }

    {
    "took": 11,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 1,
    "max_score": 0.70293105,
    "hits": [
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "4",
    "_score": 0.70293105,
    "_source": {
    "name": "special yagao",
    "desc": "special meibai",
    "price": 50,
    "producer": "special yagao producer",
    "tags": [
    "meibai"
    ]
    }
    }
    ]
    }
    }
    proximity match (近似匹配)
    query string,搜索文本,中的几个term,要经过几次移动才能与一个document匹配,这个移动的次数,就是slop

    hello world, java is very good, spark is also very good.
    java spark,match phrase,搜不到
    如果我们指定了slop,那么就允许java spark进行移动,来尝试与doc进行匹配

    java is very good spark is
    java spark
    java --> spark
    java --> spark
    java --> spark
    这里的slop,就是3,因为java spark这个短语,spark移动了3次,就可以跟一个doc匹配上了
    slop的含义,不仅仅是说一个query string terms移动几次,跟一个doc匹配上。一个query string terms,最多可以移动几次去尝试跟一个doc匹配上
    slop,设置的是3,那么就ok

    <span style="font-weight:normal;">GET /forum/article/_search
    {
    "query": {
    "match_phrase": {
    "title": {
    "query": "java spark",
    "slop": 3
    }
    }
    }
    }</span>
    其实,加了slop的phrase match,就是proximity match,近似匹配
    1、java spark,短语,doc,phrase match
    2、java spark,可以有一定的距离,但是靠的越近,越先搜索出来,proximity match

    highlight search(高亮搜索结果)

    搜索结果<em></em>高亮展示
    GET /ecommerce/product/_search
    {
    "query" : {
    "match" : {
    "producer" : "producer"
    }
    },
    "highlight": {
    "fields" : {
    "producer" : {}
    }
    }
    }
    {
    "took": 6,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 3,
    "max_score": 0.51623213,
    "hits": [
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "3",
    "_score": 0.51623213,
    "_source": {
    "name": "zhonghua yagao",
    "desc": "caoben zhiwu",
    "price": 40,
    "producer": "zhonghua producer",
    "tags": [
    "qingxin"
    ]
    },
    "highlight": {
    "producer": [
    "<em>zhonghua</em> <em>producer</em>"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "2",
    "_score": 0.25811607,
    "_source": {
    "name": "jiajieshi yagao",
    "desc": "youxiao fangzhu",
    "price": 25,
    "producer": "jiajieshi producer",
    "tags": [
    "fangzhu"
    ]
    },
    "highlight": {
    "producer": [
    "jiajieshi <em>producer</em>"
    ]
    }
    },
    {
    "_index": "ecommerce",
    "_type": "product",
    "_id": "1",
    "_score": 0.25811607,
    "_source": {
    "name": "gaolujie yagao",
    "desc": "gaoxiao meibai",
    "price": 30,
    "producer": "gaolujie producer",
    "tags": [
    "meibai",
    "fangzhu"
    ]
    },
    "highlight": {
    "producer": [
    "gaolujie <em>producer</em>"
    ]
    }
    }
    ]
    }
    }

    mget 批量查询
    1、批量查询的好处

    就是一条一条的查询,比如说要查询100条数据,那么就要发送100次网络请求,这个开销还是很大的
    如果进行批量查询的话,查询100条数据,就只要发送1次网络请求,网络请求的性能开销缩减100倍

    2、mget的语法

    (1)一条一条的查询

    GET /test_index/test_type/1
    GET /test_index/test_type/2

    (2)mget批量查询

    GET /_mget
    {
    "docs" : [
    {
    "_index" : "test_index",
    "_type" : "test_type",
    "_id" : 1
    },
    {
    "_index" : "test_index",
    "_type" : "test_type",
    "_id" : 2
    }
    ]
    }

    {
    "docs": [
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "1",
    "_version": 2,
    "found": true,
    "_source": {
    "test_field1": "test field1",
    "test_field2": "test field2"
    }
    },
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "2",
    "_version": 1,
    "found": true,
    "_source": {
    "test_content": "my test"
    }
    }
    ]
    }

    (3)如果查询的document是一个index下的不同type种的话

    GET /test_index/_mget
    {
    "docs" : [
    {
    "_type" : "test_type",
    "_id" : 1
    },
    {
    "_type" : "test_type",
    "_id" : 2
    }
    ]
    }
    (4)如果查询的数据都在同一个index下的同一个type下,最简单了

    GET /test_index/test_type/_mget
    {
    "ids": [1, 2]
    }
    3、mget的重要性

    可以说mget是很重要的,一般来说,在进行查询的时候,如果一次性要查询多条数据的话,那么一定要用batch批量操作的api
    尽可能减少网络开销次数,可能可以将性能提升数倍,甚至数十倍,非常非常之重要

    bulk语法
    POST /_bulk
    { "delete": { "_index": "test_index", "_type": "test_type", "_id": "3" }}
    { "create": { "_index": "test_index", "_type": "test_type", "_id": "12" }}
    { "test_field": "test12" }
    { "index": { "_index": "test_index", "_type": "test_type", "_id": "2" }}
    { "test_field": "replaced test2" }
    { "update": { "_index": "test_index", "_type": "test_type", "_id": "1", "_retry_on_conflict" : 3} }
    { "doc" : {"test_field2" : "bulk test1"} }
    每一个操作要两个json串,语法如下:

    {"action": {"metadata"}}
    {"data"}

    举例,比如你现在要创建一个文档,放bulk里面,看起来会是这样子的:

    {"index": {"_index": "test_index", "_type", "test_type", "_id": "1"}}
    {"test_field1": "test1", "test_field2": "test2"}

    有哪些类型的操作可以执行呢?
    (1)delete:删除一个文档,只要1个json串就可以了
    (2)create:PUT /index/type/id/_create,强制创建
    (3)index:普通的put操作,可以是创建文档,也可以是全量替换文档
    (4)update:执行的partial update操作

    bulk api对json的语法,有严格的要求,每个json串不能换行,只能放一行,同时一个json串和一个json串之间,必须有一个换行
    {
    "error": {
    "root_cause": [
    {
    "type": "json_e_o_f_exception",
    "reason": "Unexpected end-of-input: expected close marker for Object (start marker at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@5a5932cd; line: 1, column: 1]) at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@5a5932cd; line: 1, column: 3]"
    }
    ],
    "type": "json_e_o_f_exception",
    "reason": "Unexpected end-of-input: expected close marker for Object (start marker at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@5a5932cd; line: 1, column: 1]) at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@5a5932cd; line: 1, column: 3]"
    },
    "status": 500
    }

    {
    "took": 41,
    "errors": true,
    "items": [
    {
    "delete": {
    "found": true,
    "_index": "test_index",
    "_type": "test_type",
    "_id": "10",
    "_version": 3,
    "result": "deleted",
    "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
    },
    "status": 200
    }
    },
    {
    "create": {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "3",
    "_version": 1,
    "result": "created",
    "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
    },
    "created": true,
    "status": 201
    }
    },
    {
    "create": {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "2",
    "status": 409,
    "error": {
    "type": "version_conflict_engine_exception",
    "reason": "[test_type][2]: version conflict, document already exists (current version [1])",
    "index_uuid": "6m0G7yx7R1KECWWGnfH1sw",
    "shard": "2",
    "index": "test_index"
    }
    }
    },
    {
    "index": {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "4",
    "_version": 1,
    "result": "created",
    "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
    },
    "created": true,
    "status": 201
    }
    },
    {
    "index": {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "2",
    "_version": 2,
    "result": "updated",
    "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
    },
    "created": false,
    "status": 200
    }
    },
    {
    "update": {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "1",
    "_version": 3,
    "result": "updated",
    "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
    },
    "status": 200
    }
    }
    ]
    }

    bulk操作中,任意一个操作失败,是不会影响其他的操作的,但是在返回结果里,会告诉你异常日志

    POST /test_index/_bulk
    { "delete": { "_type": "test_type", "_id": "3" }}
    { "create": { "_type": "test_type", "_id": "12" }}
    { "test_field": "test12" }
    { "index": { "_type": "test_type" }}
    { "test_field": "auto-generate id test" }
    { "index": { "_type": "test_type", "_id": "2" }}
    { "test_field": "replaced test2" }
    { "update": { "_type": "test_type", "_id": "1", "_retry_on_conflict" : 3} }
    { "doc" : {"test_field2" : "bulk test1"} }

    POST /test_index/test_type/_bulk
    { "delete": { "_id": "3" }}
    { "create": { "_id": "12" }}
    { "test_field": "test12" }
    { "index": { }}
    { "test_field": "auto-generate id test" }
    { "index": { "_id": "2" }}
    { "test_field": "replaced test2" }
    { "update": { "_id": "1", "_retry_on_conflict" : 3} }
    { "doc" : {"test_field2" : "bulk test1"} }

    2、bulk size最佳大小

    bulk request会加载到内存里,如果太大的话,性能反而会下降,因此需要反复尝试一个最佳的bulk size。一般从1000~5000条数据开始,尝试逐渐增加。另外,如果看大小的话,最好是在5~15MB之间。
    scoll
    如果一次性要查出来比如10万条数据,那么性能会很差,此时一般会采取用scoll滚动查询,一批一批的查,直到所有数据都查询完处理完。使用scoll滚动搜索,可以先搜索一批数据,然后下次再搜索一批数据,以此类推,直到搜索出全部的数据来
    scoll搜索会在第一次搜索的时候,保存一个当时的视图快照,之后只会基于该旧的视图快照提供数据搜索,如果这个期间数据变更,是不会让用户看到的
    采用基于_doc进行排序的方式,性能较高
    每次发送scroll请求,我们还需要指定一个scoll参数,指定一个时间窗口,每次搜索请求只要在这个时间窗口内能完成就可以了
    GET /test_index/test_type/_search?scroll=1m
    {
    "query": {
    "match_all": {}
    },
    "sort": [ "_doc" ],
    "size": 3
    }

    {
    "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3",
    "took": 5,
    "timed_out": false,
    "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
    },
    "hits": {
    "total": 10,
    "max_score": null,
    "hits": [
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "8",
    "_score": null,
    "_source": {
    "test_field": "test client 2"
    },
    "sort": [
    0
    ]
    },
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "6",
    "_score": null,
    "_source": {
    "test_field": "tes test"
    },
    "sort": [
    0
    ]
    },
    {
    "_index": "test_index",
    "_type": "test_type",
    "_id": "AVp4RN0bhjxldOOnBxaE",
    "_score": null,
    "_source": {
    "test_content": "my test"
    },
    "sort": [
    0
    ]
    }
    ]
    }
    }
    获得的结果会有一个scoll_id,下一次再发送scoll请求的时候,必须带上这个scoll_id
    GET /_search/scroll
    {
    "scroll": "1m",
    "scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3"
    }
    scoll,看起来挺像分页的,但是其实使用场景不一样。分页主要是用来一页一页搜索,给用户看的;scoll主要是用来一批一批检索数据,让系统进行处理的

    原文链接:https://blog.csdn.net/wuzhiwei549/article/details/80362147

  • 相关阅读:
    python爬虫常见面试题(二)
    python爬虫常见面试题(一)
    回首2018,展望2019
    PDF编辑软件PDFGuru
    打字机NoisyTyper
    文本标注系统
    logstash配置
    服务器上安装python3
    scrapy自调度方案
    前端项目配置nginx配置
  • 原文地址:https://www.cnblogs.com/fulong133/p/12911010.html
Copyright © 2011-2022 走看看