zoukankan      html  css  js  c++  java
  • N-Gram的数据结构

    ARPA的n-gram语法如下:

    [html] view plaincopyprint?
    data  
    ngram 1=64000  
    ngram 2=522530  
    ngram 3=173445  
      
    1-grams:  
    -5.24036        'cause  -0.2084827  
    -4.675221       'em     -0.221857  
    -4.989297       'n      -0.05809768  
    -5.365303       'til    -0.1855581  
    -2.111539       </s>    0.0  
    -99     <s>     -0.7736475  
    -1.128404       <unk>   -0.8049794  
    -2.271447       a       -0.6163939  
    -5.174762       a's     -0.03869072  
    -3.384722       a.      -0.1877073  
    -5.789208       a.'s    0.0  
    -6.000091       aachen  0.0  
    -4.707208       aaron   -0.2046838  
    -5.580914       aaron's -0.06230035  
    -5.789208       aarons  -0.07077657  
    -5.881973       aaronson        -0.2173971  
    

    具体说明见 :ARPA的n-gram语言模型格式

    整个ARPA-LM由很多个n-gram项组成,分别说明这两个的数据结构

    一,n-gram数据结构

    n-gram的数据结构如下:

    typedef struct  
    {  
        real        log_prob ;  
        real        log_bo ;  
        int         *words ;  
    } ARPALMEntry ;  
    

    words,表示当前的n-gram所涉及的单词,如果是1-gram,那就只有一个,如果是2-gram,那么words就包括这两个单词的序号。
    log_bo,表示ngram的回退概率。
    log_prob,表示ngram的组合概率。

    二,ARPA-LM数据结构

    多个项组成的整个n-gram语言模型的数据结构如下:
    [cpp] view plaincopyprint?

    class ARPALM  
     {  
        public:  
            Vocabulary *vocab ;  
      
            int            order ;  
            ARPALMEntry    **entries ; // 语言模型的所有项,组成一个数组  
            int            *n_ngrams ; // 一元语言模型、二元语言模型、三元语言模型等组成的数组,数组每一项都表示对应的的元有多少个。  
      
            char           *unk_wrd ; // 词典中不在语言模型中的词。  
            int            unk_id ;// 词典中不在语言模型中的词ID,这个ID指定为词典的最后一个序号。  
      
            int            n_unk_words ;  
            int            *unk_words ;  
        private:   
            bool           *words_in_lm ; // 布尔类型数组,标识词是否在语言模型中。  
    }  
    

    vocab,用于构建语言模型的词典指针。词典定义见:词典内存存储模型
    entries,语言模型的所有ngram项,是ARPALMEntry类型的一个二维数组。entries[0]存储1-gram,entries[1]存储2-gram,依此类推。
    n_ngrams,整型数组,依次包含1-gram,2-gram,3-gram,....所包含的ngram项个数。
    unk_wrd,词典中可以不在语言模型中的词。
    unk_id,词典中可以不在语言模型中的词的ID,这个ID指定为词典的最后一个词序号。
    n_unk_words,在读语言模型之后,统计在词典中,但没有用来建立语言模型的词个数,如果没有指定unk_wrd的话,是不允许的,就表示所有的词典中的词都应该用来建语言模型。
    unk_words,存储6中统计的词序号。
    words_in_lm,这个标识词典中的词是否在语言模型中出现。

  • 相关阅读:
    推荐阅读20100506
    Windows 7中使用任务计划和媒体播放器当闹钟
    推荐阅读20100517
    又遇IIS 7不能压缩js文件的问题
    推荐阅读20100523
    jQuery调用WCF服务时如何传递对象参数
    Execution permission cannot be acquired
    快乐出发
    推荐阅读20100509
    参加“全球互动娱乐专家讲坛”之“创业者与创业板”的收获
  • 原文地址:https://www.cnblogs.com/jonky/p/10154115.html
Copyright © 2011-2022 走看看