MIT自然语言处理第三讲：概率语言模型（第四部分）

自然语言处理：概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年1月20日）

四、平滑算法
a) 最大似然估计（Maximum Likelihood Estimate）
　i. MLE使训练数据尽可能的“大”（MLE makes training data as probable as possible）：
　　　P_{ML}(w_{i}/{w_{i-1},w_{i-2}}) = {Count(w_{i-2},w_{i-1},w_{i})}/{Count(w_{i-2},w_{i-1})}
　　1. 对于词汇规模为N的语料库，我们将要在模型中得到N^{3}的参数（For vocabulary of size N, we will have N3 parameters in the model）
　　2. 对于N=1000，我们将要估计1000^{3}=10^{9}个参数（For N =1, 000, we have to estimate1, 000^{3}=10^{9} parameters）
　　3. 问题（Problem）: 如何处理未登录词（how to deal with unseen words）?
　ii. 数据稀疏问题（Sparsity）
　　1. 未知事件的总计概率构成了测试数据的很大一部分（The aggregate probability of unseen events constitutes a large fraction of the test data）
　　2. Brown et al (1992): 考虑一个3.5亿词的英语语料库，14%的三元词是未知的（considered a 350 million word corpus of English, 14% of trigrams are unseen）
　iii. 注：关于MLE的简要补充
　　1. 最大似然估计是一种统计方法，它用来求一个样本集的相关概率密度函数的参数。这个方法最早是遗传学家以及统计学家罗纳德•费雪爵士在1912年至1922年间开始使用的。
　　2. “似然”是对likelihood 的一种较为贴近文言文的翻译，“似然”用现代的中文来说即“可能性”。故而，若称之为“最大可能性估计”则更加通俗易懂。
　　3.MLE选择的参数使训练语料具有最高的概率，它没有浪费任何概率在训练语料中没有出现的事件中
　　4.但是MLE概率模型通常不适合做NLP的统计语言模型，会出现0概率，这是不允许的。
b) 如何估计未知元素的概率（How to estimate probability of unseen elements）?
　i. 打折（Discounting）
　　1. Laplace加1平滑（Laplace）
　　2. Good-Turing打折法（Good-Turing）
　ii. 线性插值法（Linear Interpolation）
　iii. Katz回退（Katz Back-Off）
c) 加一(Laplace)平滑（Add-One (Laplace) Smoothing）
　i. 最简单的打折方法（Simplest discounting technique）:
　　　{P(w_{i}/w_{i-1})} = {C(w_{i-1},w_{i})+1}/{C(w_{i-1})+V}
　　这里Ｖ是词汇表的数目——语料库的“型”（where |ν| is a vocabulary size）
　　注：MIT课件这里似乎有误，我已修改
　ii. 贝叶斯估计假设事件发生前是一个均匀分布（Bayesian estimator assuming a uniform unit prior on events）
　iii. 问题（Problem）: 对于未知事件占去的概率太多了（Too much probability mass to unseen events）
　iv. 例子（Example）：
　　假设V=10000(词型)，S=1000000(词例)（Assume |ν| =10, 000, and S=1, 000, 000）：
　　　P_{MLE}(ball/{kike~a}) = {{Count(kike~a~ball)}/{Count(kick~a)}} = 9/10 = 0.9
　　　P_{+1}(ball/{kike~a}) = {{Count(kike~a~ball)+1}/{Count(kick~a)+V}} = {9+1}/{10+10000} = 9*10^{-4}
　v. Laplace的缺点（Weaknesses of Laplace）
　　1. 对于稀疏分布，Laplace法则赋予未知事件太多的概率空间（For Sparse distribution, Laplace’s Law gives too much of the probability space to unseen events）
　　2. 在预测二元语法的实际概率时与其他平滑方法相比显得非常差（Worst at predicting the actual probabilities of bigrams than other methods）
　　3. 使用加epsilon平滑更合理一些（More reasonable to use add-epsilonsmoothing (Lidstone’s Law)）

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/