ABSTRACT:
此文在相关性方面介绍三项关键技术:ranking functions, semantic matching features, query rewriting;
此文内容基于拥有百亿url索引的yahoo搜素引擎;
Keywords:
learning to rank; query rewriting; semantic matching; deep learning;
1. INTRODUCTION
1)搜索相关性的发展:
- 早期 - concentrated on text matching between queries and web documents such as BM25, 概率模型,向量模型;
- 近期 - 根据用户行为改进搜索相关性,such as 点击模型;
2)目前搜索引擎面临的挑战促使我们寻找文本匹配和点击模型之外的解决方案:
- semantic gap - queries和网页文档之间的语义障碍;
- tail query - 搜索的query大部分为tail query,这类query的出现概率很低,对于搜索引擎来说完全是新词;
- Q&A systems - 用户习惯将搜索引擎看做Q&A系统;
3)在基础相关性上,相关性还包括temporal和spatial维度:
- temporal:一些query需要的是最新的信息;
- spatial:越来越多的query对地点需求强烈(旅馆等);
4)此文提出的解决方案包括:
- Designing a novel learning to rank algorithm for core ranking and a framework of contextual reranking algorithms;
- Developing semantic matching features including click similarity, deep semantic matching, and translated text matching;
- Building an innovative framework to understand user queries with query rewriting and its ranking strategy;
- Proposing solutions to recency sensitive ranking and location sensitive ranking;
2. BACKGROUND
2.1 Overview of Architecture
略(与国搜差不多)
2.2 Ranking Features
The ranking functions are built on top of these features (斜体国搜已使用):
- Web graph : the quality or the popularity of a document (eg:PageRank)
- Document statistics : some basic statistics of the document (such as the number of words in various fields)
- Document classifier : such as spam, adult, language, main topic...
- Query Features : which help in characterizing the query type (such as number of terms, frequency of the query and of its terms, click-through rate of the query)
- Text match : basic texting matching features are computed from different sections of the document (title, body, abstract, keywords) as well as from the anchor text and the URL
- Topical matching : go beyond similarity at the word level and compute similarity at the topic level;
- Click : try to incorporate user feedback
- Time : the freshness of a page
2.3 Evaluation of Search Relevance
1)评估搜索引擎结果的方法有多种,其中包括human labeling(eg:根据专业编辑的判断)、用户行为度量(eg:点击率,query重写率,停留时间等);
2)此文章中为评估base relevance,将采用第一种方法:professional editor's judgement:
对于每个query-url对,分为5个等级:Perfect, Excellent, Good, Fair, Bad;
使用DCG公式度量搜索相关性:(公式待插入)
(for a ranked list of N documents, G represents the weight assigned to the label of the document at position i)
注:DCG公式仅仅在编辑人员对相关性评估相当靠谱的情况下方才使用;
3)此文章中对即将评估的query按照其出现频率分为三个等级:
top query - 有很强辨识性的query,很容易被检索到;
torso query - 信息有限,此类query一年只会被检索几次;
tail query - 一年被检索少于一次的query
=》本论文的重点在于搜索torso query和tail query;