ARPA的n-gram语法如下:
[html] view plaincopyprint?
data
ngram 1=64000
ngram 2=522530
ngram 3=173445
1-grams:
-5.24036 'cause -0.2084827
-4.675221 'em -0.221857
-4.989297 'n -0.05809768
-5.365303 'til -0.1855581
-2.111539 </s> 0.0
-99 <s> -0.7736475
-1.128404 <unk> -0.8049794
-2.271447 a -0.6163939
-5.174762 a's -0.03869072
-3.384722 a. -0.1877073
-5.789208 a.'s 0.0
-6.000091 aachen 0.0
-4.707208 aaron -0.2046838
-5.580914 aaron's -0.06230035
-5.789208 aarons -0.07077657
-5.881973 aaronson -0.2173971
具体说明见 :ARPA的n-gram语言模型格式
整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构
一,n-gram数据结构
n-gram的数据结构如下:
typedef struct
{
real log_prob ;
real log_bo ;
int *words ;
} ARPALMEntry ;
words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。
log_bo,表示ngram的回退概率。
log_prob,表示ngram的组合概率。
二,ARPA-LM数据结构
多个项组成的整个n-gram语言模型的数据结构如下:
[cpp] view plaincopyprint?
class ARPALM
{
public:
Vocabulary *vocab ;
int order ;
ARPALMEntry **entries ; // 语言模型的所有项,组成一个数组
int *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。
char *unk_wrd ; // 词典中不在语言模型中的词。
int unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。
int n_unk_words ;
int *unk_words ;
private:
bool *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。
}
vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型
entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。
n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。
unk_wrd,词典中可以不在语言模型中的词。
unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。
n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。
unk_words,存储6中统计的词序号。
words_in_lm,这个标识词典中的词是否在语言模型中出现。