zoukankan      html  css  js  c++  java
  • 知识问答检索中的分桶检索相关设置

    1 分桶检索的需求

    基于索引的QA问答对匹配流程梳理的匹配原理介绍中,我们对QA的相似问进行了入库预处理,并生成了相关的特征向量。在入库时我们是针对问题进行的入库,但在实际的业务场景中,每一个类目下有很多的知识,每个知识又有很多的问法,如果单纯的进行了相似问法匹配返回问法的得分,就会出现同一个知识的问法占据了topN问题。针对这个问题,我们希望针对检索的问法进行合并,每一个知识仅返回该知识中得分最高的一条即可,同时返回的问法数量可以控制。

    2 设计实现

    ES在字段设计时增加kid知识字段,用于存储每一个问法所属的知识id,是一对多的形式,在检索时基于kid字段进行分组查询,每组返回一条得分最高的数据,同时设置返回的分桶数量。
    经过上述设计后,进行了数据实现,并测试验证(此代码后续验证有bug),分组查询的相关代码如下所示:

    // 查询条件封装
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    // 构建morelikethis查询语句
    BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                        .filter(QueryBuilders.termsQuery("online", "1"))
                        .filter(QueryBuilders.termsQuery("userId", userId))
                        .filter(QueryBuilders.termsQuery("category", category.split(",")))
                        .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
    // 排序规则
    AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
    // 获取每个分组时间倒序排列的第一条记录
    AggregationBuilder top = AggregationBuilders.topHits("result")
                    .fetchSource(new String[]{"id", "title", "kId"}, null)
                    .size(1);
    // 封装分组查询的相关条件
    TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                    .field("kId").executionHint("map");
    // 返回分组数
    groupTermsBuilder.size(maxNum);
    groupTermsBuilder.subAggregation(top);
    groupTermsBuilder.subAggregation(maxScore);
    
    searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
    

    在进行验证查询时发现,每个组是返回了该组的最高得分,但是组之间还存在更高的得分的问题,如下查询结果所示(结果做了处理,仅展示):

    {
    	"took": 2,
    	"timed_out": false,
    	"_shards": {
    		"total": 3,
    		"successful": 3,
    		"skipped": 0,
    		"failed": 0
    	},
    	"hits": {
    		"total": 135,
    		"max_score": 25.215496,
    		"hits": [{
    				"_index": "qaknowwledge",
    				"_type": "doc",
    				"_id": "11657847994935215",
    				"_score": 25.215496,
    				"_source": {
    					"category": "11656187146936040",
    					"id": "11657847994935215",
    					"kId": "11657847993624508",
    					"online": "1",
    					"qStr": "存储的问法1",
    					"title": "知识1",
    					"userId": "10869305621348777"
    				}
    			},
    			{
    				"_index": "qaknowwledge",
    				"_type": "doc",
    				"_id": "11657847994935216",
    				"_score": 10.988454,
    				"_source": {
    					"category": "11656187146936040",
    					"id": "11657847994935216",
    					"kId": "11657847993624508",
    					"online": "1",
    					"questionStr": "问法2",
    					"title": "知识2",
    					"userId": "10869305621348777"
    				}
    			}
    		]
    	},
    	"aggregations": {
    		"groupkId": {
    			"doc_count_error_upper_bound": 0,
    			"sum_other_doc_count": 72,
    			"buckets": [{
    					"key": "11657847993624494",
    					"doc_count": 5,
    					"result": {
    						"hits": {
    							"total": 5,
    							"max_score": 3.8905885,
    							"hits": [{
    								"_index": "qaknowwledge",
    								"_type": "doc",
    								"_id": "11657847994935160",
    								"_score": 3.8905885,
    								"_source": {
    									"kId": "11657847993624494",
    									"id": "11657847994935160",
    									"title": "知识"
    								}
    							}]
    						}
    					},
    					"scoreTop": {
    						"value": 3.8905885219573975
    					}
    				}
    			]
    		}
    	}
    }
    

    我们发现打分搞的第一条记录并没有出现在分组的查询中,我们把查询语句打印出来如下:

    {
      "size": 20,
      "timeout": "60s",
      "query": {
        "bool": {
          "must": [
            {
              "more_like_this": {
                "fields": [
                  "questionStr"
                ],
                "like": [
                  "问法"
                ],
                "max_query_terms": 25,
                "min_term_freq": 0,
                "min_doc_freq": 0,
                "max_doc_freq": 2147483647,
                "min_word_length": 2,
                "max_word_length": 0,
                "minimum_should_match": "30%",
                "boost_terms": 0,
                "include": false,
                "fail_on_unsupported_field": true,
                "boost": 1
              }
            }
          ],
          "filter": [
            {
              "terms": {
                "online": [
                  "1"
                ],
                "boost": 1
              }
            }
          ],
          "adjust_pure_negative": true,
          "boost": 1
        }
      },
      "aggregations": {
        "groupkId": {
          "terms": {
            "field": "kId",
            "size": 20,
            "min_doc_count": 1,
            "shard_min_doc_count": 0,
            "show_term_doc_count_error": false,
            "execution_hint": "map",
            "order": [
              {
                "_count": "desc"
              },
              {
                "_key": "asc"
              }
            ]
          },
          "aggregations": {
            "result": {
              "top_hits": {
                "from": 0,
                "size": 1,
                "version": false,
                "explain": false,
                "_source": {
                  "includes": [
                    "id",
                    "title",
                    "kId"
                  ],
                  "excludes": []
                }
              }
            },
            "scoreTop": {
              "max": {
                "script": {
                  "source": "_score",
                  "lang": "painless"
                }
              }
            }
          }
        }
      }
    }
    

    分析发现,我们设置的排序策略并没有生效,从上文看排序仍然是按照分组匹配到的数量进行的排序,也就是

    "terms": {
    	"field": "kId",
    	"size": 20,
    	"min_doc_count": 1,
    	"shard_min_doc_count": 0,
    	"show_term_doc_count_error": false,
    	"execution_hint": "map",
    	"order": [{
    			"_count": "desc"
    		},
    		{
    			"_key": "asc"
    		}
    	]
    }
    

    对上述查询代码进行查看,发现我们仅设置了聚合后的查询字段,但是该查询字段并没有应用到分组上,进行处理即可,代码如下:

    // 查询条件封装
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    // 构建morelikethis查询语句
    BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery()
                        .filter(QueryBuilders.termsQuery("online", "1"))
                        .filter(QueryBuilders.termsQuery("userId", userId))
                        .filter(QueryBuilders.termsQuery("category", category.split(",")))
                        .must(QueryBuilders.moreLikeThisQuery(new String[]{"questionStr"}, new String[]{questionStr}, null).minTermFreq(0).minDocFreq(0).minWordLength(2));
    // 排序规则
    AggregationBuilder maxScore = AggregationBuilders.max("_score").field("_score");
    // 获取每个分组时间倒序排列的第一条记录
    AggregationBuilder top = AggregationBuilders.topHits("result")
                    .fetchSource(new String[]{"id", "title", "kId", "qSimhas"}, null)
                    .size(1);
    // 封装分组查询的相关条件
    TermsAggregationBuilder groupTermsBuilder = AggregationBuilders.terms("groupkId")
                    .field("kId").executionHint("map").order(BucketOrder.aggregation("scoreTop", false));
    // 返回分组数
    groupTermsBuilder.size(maxNum);
    groupTermsBuilder.subAggregation(top);
    groupTermsBuilder.subAggregation(maxScore);
    
    searchSourceBuilder.query(boolQueryBuilder).aggregation(groupTermsBuilder).size(0);
    

    即将"scoreTop"应用到"groupTermsBuilder"上即可,这样对打印出的查询语句即可看到,排序已经按照每组的查询最高分进行了。

    参考:
    es term 聚合时能按_score进行排序么
    es java api 进行聚合+桶聚合查询 terms+top_hits+max

  • 相关阅读:
    POJ 3672 水题......
    POJ 3279 枚举?
    STL
    241. Different Ways to Add Parentheses
    282. Expression Add Operators
    169. Majority Element
    Weekly Contest 121
    927. Three Equal Parts
    910. Smallest Range II
    921. Minimum Add to Make Parentheses Valid
  • 原文地址:https://www.cnblogs.com/yhzhou/p/13574015.html
Copyright © 2011-2022 走看看