为了提高搜索准确度,为用户提供个性化推荐等目的,每个搜索引擎都会保存用户的搜索历史。包括查询的query,time, ip, 操作系统和浏览器信息等等。还有就是记录这个query用户点击查看了哪些搜索结果。
出于商业目的和用户隐私,这些search log是不公开的。
从search log中我们已经可以得到以下结论:
- 用户喜欢短query,平均在3.5 words
- 大部分的搜索都是由小部分人完成的,而其他大部分人都不怎么进行搜索
- 一个query的查询频率是排名的指数关系Frequency(q) = K × Rank(q)^(-α),K是常数,Rank(q)是q的热度排名
- 排名在前面的query的查询次数占总查询次数的大部分,而排名靠后的query的查询此时则少的可怜
- query的rank(热度排名)是随着时间变化的,重叠比较少。
将search log分成session,可以按照时间,按照query相似度,和common reformulation patterns进行划分
三. 从search log中提取出用户的需求
Goal: An information need is a single, well-defined goal.
Mission: A mission is a set of related information needs.
有许多启发式(heuristics)的方法. 比如Baseline,30 minutes,Trained time,commonw等等
或者是按照features:[Temporal, Edit distance, Query log, Web search] 来进行分类。这算是机器学习的范畴。这个方法的准确率要好于启发式的方法。
四. Query suggestions
Search Trail(踪迹): a single information seeking session
稀少的query要推荐其他的query可能并不能从已有的search log里面获取太多信息,这个时候可以将query拆分重组,分成几个subquery然后再进行匹配和推荐。
假设现在已经找了一段可能相关的queries,如何决定recommend顺序. 这时就需要对queries进行rank,rank的方法很多,这也是机器学习这一块了:linear model, SVM, bag of 100 DTs
五. 从用户的点击获取反馈
- 用户基本不会不主动feedback
- 用户基本只看前3-6条记录,而且极少会翻页
- 用户点击受到很多方面的影响,比如用户会默认排在前面的质量就好、摘要的好坏也会影响用户是否点击链接
- 基于上面提及的原因,仅仅靠一个click来判断doc的相关性是不可靠的
- 对于同一个query(eg: apple),一个人的想法可能对另一个人没有借鉴意义,但是群体的想法一般是可以反映个体的需求的。
六. 从implicit feedback中来推断documents的relevance
1. relevance(Doc[i])=Observed click rate - expected click rate
2. Click Deviation:
Deviation (Di, r) = Observed_Rate (Di, r) – Expected_Rate (r)
observed rate(Di, r): the click rate of Doc[i] in rank r.
Expected rate(r): expected rate of a doc in rank r
3. CDiff
Compute Deviation (Di, r) for each document Diat rank r
Prefer (Di, Dj) iff Deviation (Di, ri) – Deviation (Dj,rj) > m
七. A-B Testing
什么叫A-B Testing?wikipedia的定义如下:
A/B testing
is a methodology in advertising of using
randomized experiments
with two variants, A and B, which are the control and treatment in the
controlled experiment. Such experiments are commonly used in
web development
marketing, as well as in more traditional forms of advertising.
A / B Testing
– Suppose we have a search engine that people use already
– Mix results from baseline algorithm and new algorithm
– Monitor which results people click
• Abandonment rate: % of queries that had no clicks
• Reformulation rate: % of queries followed by another query
• Queries per session
• Clicks per query
• Maximum Reciprocal Rank: 1 / highest rank of a clicked doc
• Mean Reciprocal Rank (MRR): mean of 1 / rank of clicked doc
• Time to first click
• Time to last click
做A-B testing时,用户来一个query,用A,B方法同时得到两组搜索结果,然后将两组结果混合,返回给用户。统计用户偏向于哪个结果。
八. 判断用户意图
1. Expand
2. Filter
3. Cluster
4. Estimate the popularity 估计每个intent group的score
5. Name the intent group: Use its highest-scoring query
九. Personalization
Training data
P (qi, tj) = % of documents for qi that were clicked & about topic tj