【阅读笔记】Ranking Relevance in Yahoo Search （三）—— query rewriting

zoukankan html css js c++ java

【阅读笔记】Ranking Relevance in Yahoo Search （三）—— query rewriting
5. QUERY REWRITING

作用：
- query rewriting is the task of altering a given query so that it will get better results and, more importantly, to help solve the recall problem.
- can treat it as a machine translation problem: language of user queries(S) <=> language of web documents(T)
5.1 Methodology

两个阶段：
- learning phase: learns phrase-level translations from queries to documents;
- decoding phase: generates candidates for a given query;
Learning Phase =>

此阶段存在的困难：获取大量query - 可以提高相关度的rewritten query训练数据；

困难原因：1）好的翻译模型需要超大量的双语文本；2）编辑不能很好的选择什么样的query可以提高相关性；

解决方案：
- 使用click graphs（加权无向图：queries和doc是nodes，edges代表queries和document的点击，权重是点击数）
- 使用文章title作为对应的rewritten query（因为相对于文章body，文章title与query更加相似）
- 根据得到的query-title配对，we follow the common steps for a typical phrase-based matching translation framework to learn phrase-level translations；
Decoding Phase =>

作用：

每个query（q）都有很多分词的方法得到多个phrase，而且每个phrase都有很多translation，这导致将出现成百上千的候选rewritten_query；

=》decoding phase将在这些候选词中挑出最可靠的rewritten_query（q_w）；

公式：（待添加）

h_i(q_c,q)代表第i个feature function；λ_i指定该function的权重，λ_i可以被人工指定或者通过loss function学习得到；

特征函数：

对于每对(q_c,q)，本论文包含3种类型的feature function：Query feature functions, Rewrite query feature functions, Pair feature functions；

（Query feature functions）

h₁ - number of words in q；h₂ - number of stop words in q；h₃ - language model score of the query q；h₄ - query frequency of q；h₅ - average length of words in q；

（Rewrite query feature functions）

h₆ - number of words in q_c；h₇ - number of stop words in q_c；h₈ - language model score of the query q_c；h₉ - query frequency of q_c；h₁₀ - average length of words in q_c；

（Pair feature functions）

h₁₁ - Jaccard similarity of URLs shared by q and q_c in the query-URL graph；

h₁₂ - difference between the frequencies of q and q_c；

h₁₃ - word-level cosine“余弦” similarity between q and q_c；

h₁₄ - difference between the number of words between q and q_c；

h₁₅ - number of common words in q and q_c；

h₁₆ - difference of language model scores between q and q_c；

h₁₇ - difference of the number of stop words between q and q_c；

h₁₈ - difference of the average length of words between q and q_c；

=》经实验，发现h_11,h_12, h13是最重要的三个feature functions；

5.2 Ranking Strategy

根据original query和rewritten query，有两种排序策略：

Replace the original query with the rewritten query （未采用）=>

评估：直接采用replace的方式很冒险，一些低质量的rewrites会对相关度造成负面影响；

Blending mode（采用） =>

方法：

1）分别使用original query（q）和rewritten query（q_c）从搜索引擎中获取top-N个文档，并记录下两次获得的文档的序列和分值（O， R）；

2）从O和R中取交集：若文档D同时出现在O和R中，D的最终分数未max(O, R)；

3）在此基础上根据各文档的分值进行排序，选择top-N作为original query搜索的最终结果；

两种排序策略的评估：

两种方法都能对tail query的搜索相关度进行显著的提高；

但是由于rewritten query可能改变original query的目的，Replace策略的结果不如Blending Mode的好；
查看全文

相关阅读:
C# 枚举常用工具方法
 AppBox_v3.0
DDD：四色原型中Role的 “六” 种实现方式和PHP的Swoole扩展
 MySql主从配置实践及其优势浅谈
 ActionInvoker
【Oracle】-【体系结构】-【DBWR】-DBWR进程相关理解
 Linux MySQL单实例源码编译安装5.6
窗口嵌入到另一个窗口（VC和QT都有）
Window下 Qt 编译MySQL驱动（居然用到了动态库格式转换工具，需要将C:/MySQL/lib目录下的libmySQL.dll文件复制到我们Qt Creator安装目录下的qt/bin目录中）good
在Linux下使用iconv转换字符串编码

原文地址：https://www.cnblogs.com/tanfy/p/8378522.html