MIT自然语言处理第三讲：概率语言模型（第一部分）

自然语言处理：概率语言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
译者：我爱自然语言处理（www.52nlp.cn ，2009年1月16日）

上一讲主要内容回顾（Last time）
　　语料库处理（Corpora processing）
　　齐夫定律（Zipf’s law）
　　数据稀疏问题（Data sparseness）
本讲主要内容（Today）：
　　概率语言模型（Probabilistic language Modeling）

一、简单介绍
a) 预测字符串概率（Predicting String Probabilities）
　i. 那一个字符串更有可能或者更符合语法Which string is more likely? (Which string is more grammatical?)
　　1. Grill doctoral candidates.
　　2. Grill doctoral updates.
　　(example from Lee 1997)
　ii. 向字符串赋予概率的方法被称之为语言模型（Methods for assigning probabilities to strings are called Language Models.）
b) 动机（Motivation）
　i. 语音识别，拼写检查，光学字符识别和其他应用领域（Speech recognition, spelling correction, optical character recognition and other applications）
　ii. 让E作为物证（？不确定翻译），我们需要决定字符串W是否是有E编码而得到的消息（Let E be physical evidence, and we need to determine whether the string W is the message encoded by E）
　iii. 使用贝叶斯规则（Use Bayes Rule）：
　　　　P(W/E)={P_{LM}(W)P(E/W)}/{P(E)}　
　其中P_{LM}(W)是语言模型概率(where P_{LM}(W)is language model probability)
　iv. P_{LM}(W)提供了必要的消歧信息(P_{LM}(W)provides the information necessary for isambiguation (esp. when the physical evidence is not sufficient for disambiguation))
c) 如何计算（How to Compute it）?
　i. 朴素方法（Naive approach）:
　　1. 使用最大似然估计（Use the maximum likelihood estimates (MLE)）——字符串在语料库S中存在次数的值由语料库规模归一化（the number of times that the string occurs in the corpus S, normalized by the corpus size）：
P_{MLE}(Grill~doctorate~candidates)={count(Grill~doctorate~candidates)}/delim{|}{S}{|}
　　2. 对于未知事件，最大似然估计P_{MLE}=0（For unseen events, P_{MLE}=0）
　　——数据稀疏问题比较“可怕”（Dreadful behavior in the presence of Data Sparseness）
d) 两个著名的句子（Two Famous Sentences）
　i. “It is fair to assume that neither sentence
　　“Colorless green ideas sleep furiously”
　　nor
　　“Furiously sleep ideas green colorless”
　　… has ever occurred … Hence, in any statistical model … these　sentences will be ruled out on identical grounds as equally “remote” from English. Yet (1), though nonsensical, is grammatical, while (2) is not.” [Chomsky 1957]
　ii. 注：这是乔姆斯基《句法结构》第9页上的：下面的两句话从来没有在一段英语谈话中出现过，从统计角度看离英语同样的“遥远”，但只有句1是合乎语法的：
　　1) Colorless green ideas sleep furiously.
　　2) Furiously sleep ideas sleep green colorless .
　　“从来没有在一段英语谈话中出现过”、“从统计角度看离英语同样的‘遥远’”要看从哪个角度去看了，如果抛开具体的词汇、从形类角度看，恐怕句1的统计频率要高于句2而且在英语中出现过。

附：课程及课件pdf下载MIT英文网页地址：
　　　http://people.csail.mit.edu/regina/6881/