zoukankan      html  css  js  c++  java
  • 【阅读笔记】Ranking Relevance in Yahoo Search (三)—— query rewriting

    5. QUERY REWRITING

    作用:

    • query rewriting is the task of altering a given query so that it will get better results and, more importantly, to help solve the recall problem.
    • can treat it as a machine translation problem: language of user queries(S) <=> language of web documents(T)

    5.1 Methodology

    两个阶段:

    • learning phase: learns phrase-level translations from queries to documents;
    • decoding phase: generates candidates for a given query;

    Learning Phase =>

    此阶段存在的困难:获取大量query - 可以提高相关度的rewritten query训练数据;

    困难原因:1)好的翻译模型需要超大量的双语文本;2)编辑不能很好的选择什么样的query可以提高相关性;

    解决方案:

    • 使用click graphs(加权无向图:queries和doc是nodes,edges代表queries和document的点击,权重是点击数)
    • 使用文章title作为对应的rewritten query(因为相对于文章body,文章title与query更加相似)
    • 根据得到的query-title配对,we follow the common steps for a typical phrase-based matching translation framework to learn phrase-level translations;

    Decoding Phase =>

     作用:

    每个query(q)都有很多分词的方法得到多个phrase,而且每个phrase都有很多translation,这导致将出现成百上千的候选rewritten_query;

    =》decoding phase将在这些候选词中挑出最可靠的rewritten_query(qw);

    公式:(待添加)

    hi(qc,q)代表第i个feature function;λi指定该function的权重,λi可以被人工指定或者通过loss function学习得到;

    特征函数:

    对于每对(qc,q),本论文包含3种类型的feature function:Query feature functions, Rewrite query feature functions, Pair feature functions;

    (Query feature functions)

    h1 - number of words in q;h2 - number of stop words in q;h3 - language model score of the query q;h4 - query frequency of q;h5 - average length of words in q;

    (Rewrite query feature functions)

    h6 - number of words in qc;h7 - number of stop words in qc;h8 - language model score of the query qc;h9 - query frequency of qc;h10 - average length of words in qc

    (Pair feature functions)

    h11 - Jaccard similarity of URLs shared by q and qc in the query-URL graph;

    h12 - difference between the frequencies of q and qc

    h13 - word-level cosine“余弦” similarity between q and qc

    h14 - difference between the number of words between q and qc

    h15 - number of common words in q and qc

    h16 - difference of language model scores between q and qc

    h17 - difference of the number of stop words between q and qc

    h18 - difference of the average length of words between q and qc

    =》经实验,发现h11, h12, h13是最重要的三个feature functions;

    5.2 Ranking Strategy

    根据original query和rewritten query,有两种排序策略:

    Replace the original query with the rewritten query (未采用)=>

    评估:直接采用replace的方式很冒险,一些低质量的rewrites会对相关度造成负面影响;

    Blending mode(采用) =>

    方法:

     1)分别使用original query(q)和rewritten query(qc)从搜索引擎中获取top-N个文档,并记录下两次获得的文档的序列和分值(O, R);

    2)从O和R中取交集:若文档D同时出现在O和R中,D的最终分数未max(O, R);

    3)在此基础上根据各文档的分值进行排序,选择top-N作为original query搜索的最终结果;

    两种排序策略的评估:

    两种方法都能对tail query的搜索相关度进行显著的提高;

    但是由于rewritten query可能改变original query的目的,Replace策略的结果不如Blending Mode的好;

  • 相关阅读:
    内存问题定位与解决
    CPU问题定位与解决
    数据库优化案例——————某市中心医院HIS系统
    系统隐形杀手——阻塞与等待
    SQL Server常见问题介绍及快速解决建议
    CVTE实习感想--2014.10秋招
    Spring AOP的理解
    一些Java面试问题
    举几个大数据的例子
    Java中乐观锁与悲观锁的实现
  • 原文地址:https://www.cnblogs.com/tanfy/p/8378522.html
Copyright © 2011-2022 走看看