zoukankan      html  css  js  c++  java
  • 知识问答检索中的分桶检索相关设置

    1 分桶检索的需求

    基于索引的QA问答对匹配流程梳理的匹配原理介绍中,我们对QA的相似问进行了入库预处理,并生成了相关的特征向量。在入库时我们是针对问题进行的入库,但在实际的业务场景中,每一个类目下有很多的知识,每个知识又有很多的问法,如果单纯的进行了相似问法匹配返回问法的得分,就会出现同一个知识的问法占据了topN问题。针对这个问题,我们希望针对检索的问法进行合并,每一个知识仅返回该知识中得分最高的一条即可,同时返回的问法数量可以控制。

    2 设计实现

    ES在字段设计时增加kid知识字段,用于存储每一个问法所属的知识id,是一对多的形式,在检索时基于kid字段进行分组查询,每组返回一条得分最高的数据,同时设置返回的分桶数量。
    经过上述设计后,进行了数据实现,并测试验证(此代码后续验证有bug),分组查询的相关代码如下所示:

    // 查询条件封装
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    // 构建morelikethis查询语句
    BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                        .filter(QueryBuilders.termsQuery("online", "1"))
                        .filter(QueryBuilders.termsQuery("userId", userId))
                        .filter(QueryBuilders.termsQuery("category", category.split(",")))
                        .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
    // 排序规则
    AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
    // 获取每个分组时间倒序排列的第一条记录
    AggregationBuilder top = AggregationBuilders.topHits("result")
                    .fetchSource(new String[]{"id", "title", "kId"}, null)
                    .size(1);
    // 封装分组查询的相关条件
    TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                    .field("kId").executionHint("map");
    // 返回分组数
    groupTermsBuilder.size(maxNum);
    groupTermsBuilder.subAggregation(top);
    groupTermsBuilder.subAggregation(maxScore);
    
    searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
    

    在进行验证查询时发现,每个组是返回了该组的最高得分,但是组之间还存在更高的得分的问题,如下查询结果所示(结果做了处理,仅展示):

    {
    	"took": 2,
    	"timed_out": false,
    	"_shards": {
    		"total": 3,
    		"successful": 3,
    		"skipped": 0,
    		"failed": 0
    	},
    	"hits": {
    		"total": 135,
    		"max_score": 25.215496,
    		"hits": [{
    				"_index": "qaknowwledge",
    				"_type": "doc",
    				"_id": "11657847994935215",
    				"_score": 25.215496,
    				"_source": {
    					"category": "11656187146936040",
    					"id": "11657847994935215",
    					"kId": "11657847993624508",
    					"online": "1",
    					"qStr": "存储的问法1",
    					"title": "知识1",
    					"userId": "10869305621348777"
    				}
    			},
    			{
    				"_index": "qaknowwledge",
    				"_type": "doc",
    				"_id": "11657847994935216",
    				"_score": 10.988454,
    				"_source": {
    					"category": "11656187146936040",
    					"id": "11657847994935216",
    					"kId": "11657847993624508",
    					"online": "1",
    					"questionStr": "问法2",
    					"title": "知识2",
    					"userId": "10869305621348777"
    				}
    			}
    		]
    	},
    	"aggregations": {
    		"groupkId": {
    			"doc_count_error_upper_bound": 0,
    			"sum_other_doc_count": 72,
    			"buckets": [{
    					"key": "11657847993624494",
    					"doc_count": 5,
    					"result": {
    						"hits": {
    							"total": 5,
    							"max_score": 3.8905885,
    							"hits": [{
    								"_index": "qaknowwledge",
    								"_type": "doc",
    								"_id": "11657847994935160",
    								"_score": 3.8905885,
    								"_source": {
    									"kId": "11657847993624494",
    									"id": "11657847994935160",
    									"title": "知识"
    								}
    							}]
    						}
    					},
    					"scoreTop": {
    						"value": 3.8905885219573975
    					}
    				}
    			]
    		}
    	}
    }
    

    我们发现打分搞的第一条记录并没有出现在分组的查询中,我们把查询语句打印出来如下:

    {
      "size": 20,
      "timeout": "60s",
      "query": {
        "bool": {
          "must": [
            {
              "more_like_this": {
                "fields": [
                  "questionStr"
                ],
                "like": [
                  "问法"
                ],
                "max_query_terms": 25,
                "min_term_freq": 0,
                "min_doc_freq": 0,
                "max_doc_freq": 2147483647,
                "min_word_length": 2,
                "max_word_length": 0,
                "minimum_should_match": "30%",
                "boost_terms": 0,
                "include": false,
                "fail_on_unsupported_field": true,
                "boost": 1
              }
            }
          ],
          "filter": [
            {
              "terms": {
                "online": [
                  "1"
                ],
                "boost": 1
              }
            }
          ],
          "adjust_pure_negative": true,
          "boost": 1
        }
      },
      "aggregations": {
        "groupkId": {
          "terms": {
            "field": "kId",
            "size": 20,
            "min_doc_count": 1,
            "shard_min_doc_count": 0,
            "show_term_doc_count_error": false,
            "execution_hint": "map",
            "order": [
              {
                "_count": "desc"
              },
              {
                "_key": "asc"
              }
            ]
          },
          "aggregations": {
            "result": {
              "top_hits": {
                "from": 0,
                "size": 1,
                "version": false,
                "explain": false,
                "_source": {
                  "includes": [
                    "id",
                    "title",
                    "kId"
                  ],
                  "excludes": []
                }
              }
            },
            "scoreTop": {
              "max": {
                "script": {
                  "source": "_score",
                  "lang": "painless"
                }
              }
            }
          }
        }
      }
    }
    

    分析发现,我们设置的排序策略并没有生效,从上文看排序仍然是按照分组匹配到的数量进行的排序,也就是

    "terms": {
    	"field": "kId",
    	"size": 20,
    	"min_doc_count": 1,
    	"shard_min_doc_count": 0,
    	"show_term_doc_count_error": false,
    	"execution_hint": "map",
    	"order": [{
    			"_count": "desc"
    		},
    		{
    			"_key": "asc"
    		}
    	]
    }
    

    对上述查询代码进行查看,发现我们仅设置了聚合后的查询字段,但是该查询字段并没有应用到分组上,进行处理即可,代码如下:

    // 查询条件封装
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    // 构建morelikethis查询语句
    BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                        .filter(QueryBuilders.termsQuery("online", "1"))
                        .filter(QueryBuilders.termsQuery("userId", userId))
                        .filter(QueryBuilders.termsQuery("category", category.split(",")))
                        .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
    // 排序规则
    AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
    // 获取每个分组时间倒序排列的第一条记录
    AggregationBuilder top = AggregationBuilders.topHits("result")
                    .fetchSource(new String[]{"id", "title", "kId", "qSimhas"}, null)
                    .size(1);
    // 封装分组查询的相关条件
    TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                    .field("kId").executionHint("map").order(BucketOrder.aggregation("scoreTop", false));
    // 返回分组数
    groupTermsBuilder.size(maxNum);
    groupTermsBuilder.subAggregation(top);
    groupTermsBuilder.subAggregation(maxScore);
    
    searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
    

    即将"scoreTop"应用到"groupTermsBuilder"上即可,这样对打印出的查询语句即可看到,排序已经按照每组的查询最高分进行了。

    参考:
    es term 聚合时能按_score进行排序么
    es java api 进行聚合+桶聚合查询 terms+top_hits+max

  • 相关阅读:
    linux 运维必备150个命令
    CentOS 6.5 安装nginx 1.6.3
    centos 6.5 zabbix3.0.4 监控apache
    iOS更改ShareSDK默认的分享功能界面
    使用AFNetworking时, 控制器点击返回销毁了, 但还是会执行请求成功或失败的block, 导致野指针异常
    iOS性能优化
    'Invalid update: invalid number of rows in section xx. The number of rows contained in an existing section after the update (xxx)...
    iOS 改变UITextField中光标颜色
    使用ShareSDK完成Facebook第三方登录和Facebook分享时没办法跳转到Facebook应用
    [!] Unable to satisfy the following requirements:
  • 原文地址:https://www.cnblogs.com/yhzhou/p/13574015.html
Copyright © 2011-2022 走看看