zoukankan      html  css  js  c++  java
  • Faster-rnnlm代码分析1

    https://github.com/yandex/faster-rnnlm

       

    Gdb ./rnnlm

    r -rnnlm model-good.faster -train thread.title.good.train.txt -valid thread.title.good.valid.txt -hidden 5- -direct-order 3 -direct 200 -bptt 4 -bptt-block 10 -threads 1

    [root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# more thread.title.good.train.txt

    唉        稳        凉菜        干货        批发        稳        左成        个        月        都

    咦        丢        图        跑

    毕竟        新人

    我        想去旅行

    昨天        玩        个        满        深渊        人马                 才        踩        了        55

    这        状态        还        不如        温网

    新型        投资项目

    晒        早饭        就        酱

    渣土        哥        真是        太        放肆        了

    推荐        就是        有        这样        的

    白素贞        水        漫        文水        城

    我知道        那些夏天        就像        你        一样        回        不

    渑池        至        洛阳        最早        的        车        几        点        哪里        坐        到        洛阳        几点

    宏观        方面        大        的        流动性        格局        虽无        明显        变化        但        眼下        地方        政府        债务        限

    电工        行业        竞争        大        锦力        电器        有        优势

    兄弟        啊                 影技        1        班        q        群        是        多少

    你们        家乡        话        叫        什么

    深深        的        孤独感        与        挫败        感        感觉        个人

    一起去        旅游        吧

    谁知道        四会        那里        有        修        打火机        的

    [root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# pwd

    /home/users/chenghuige/other/faster-rnnlm.debug/faster-rnnlm

    1. 统计词频建立vocabulary

    void Vocabulary::BuildFromCorpus(const std::string& fpath, bool show_progress)

    首先添加一个 </s>

    AddWord(kEOSTag); 只是编号0

       

    然后逐个添加每行

    每行处理的时候按照IsSpace切分

    inline bool IsSpace(char c) {

    return c == ' ' || c == ' ' || c == ' ' || c == ' ';

       

    然后其实就是对每个词 类似 Identifer.h那样顺序编号,没出现的词 叫做oov 编号 -1

       

    除了编号之外 同时统计频次

    最后按照频次排序 从大到小 同时更新编号 也就是频次最大的 这里 </s> 编号为0

    (gdb) p words_

    $20 = std::vector of length 176788, capacity 262144 = {{freq = 900000, word = 0x6ae1c0 "</s>"}, {

    freq = 258246, word = 0x6aef20 "265304"}, {freq = 126910, word = 0x6aeff0 "301313"}, {

    freq = 101904, word = 0x6aedc0 "316322"}, {freq = 67328, word = 0x6aeee0 "323320"}, {

    freq = 62290, word = 0x6aec10 "270366"}, {freq = 60866, word = 0x6afb20 "322273"}, {

       

    [root@cq01-forum-rstree01.cq01.baidu.com faster-rnnlm]# wc -l thread.title.good.train.txt

    900000 thread.title.good.train.txt

       

    gdb) p cfg

    $2 = {layer_size = 5, layer_count = 1, maxent_hash_size = 199947228, maxent_order = 3, use_nce = false, nce_lnz = 9, reverse_sentence = false, layer_type = "sigmoid"}

       

    1. 构建网格结构

    main_nnet = new NNet(vocab, cfg, use_cuda, use_cuda_memory_efficient);

    构造函数调用Init 在这里

       

    embeddings.resize(vocab.size(), cfg.layer_size);

    //(word_num, hidden_size) 二维数组

       

    rec_layer = CreateLayer(cfg.layer_type, cfg.layer_size, cfg.layer_count);

    //隐层 建立一个layer 默认layer_typesigmoid

       

    maxent_layer.Init(cfg.maxent_hash_size);

    //最大熵 @TODO

       

    softmax_layer = HSTree::CreateHuffmanTree(vocab, cfg.layer_size);

    //输出层 softmax 采用huffman

       

       

  • 相关阅读:
    [HIHO1223]不等式(离散化,枚举)
    [NYIST15]括号匹配(二)(区间dp)
    [HIHO1328]逃离迷宫(bfs,位压)
    [Topcoder]AvoidRoads(dp,hash)
    [POJ1159]Palindrome(dp,滚动数组)
    [Topcoder]ZigZag(dp)
    [NYIST32]组合数(状压,枚举,暴力)
    [NYIST737]石子合并(一)(区间dp)
    [HIHO1322]树结构判定(并查集)
    [HIHO1143]骨牌覆盖问题·一(矩阵快速幂,递推)
  • 原文地址:https://www.cnblogs.com/rocketfan/p/4947311.html
Copyright © 2011-2022 走看看