zoukankan      html  css  js  c++  java
  • 01-NLP-04-02用RNN做文本生成RNN

    不用one-hot来表示输入x,是因为想要用word2vec

    将每个单词得到vector,将每个vector拼接成一个sequence。[[w1],[w2],[w3]]

    用RNN做文本生成

    举个小小的例子,来看看LSTM是怎么玩的

    我们这里不再用char级别,我们用word级别来做。

    第一步,一样,先导入各种库

    In [118]:
    import os
    import numpy as np
    import nltk
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import Dropout
    from keras.layers import LSTM
    from keras.callbacks import ModelCheckpoint
    from keras.utils import np_utils
    from gensim.models.word2vec import Word2Vec

    接下来,我们把文本读入

    In [119]:
    raw_text = ''
    #读入一堆数据。上下文对照是根据句子跟句子之间。需要先分割好句子 for file in os.listdir("../input/"): if file.endswith(".txt"): raw_text += open("../input/"+file, errors='ignore').read() + ' ' # raw_text = open('../input/Winston_Churchil.txt').read() raw_text = raw_text.lower() sentensor = nltk.data.load('tokenizers/punkt/english.pickle') #分割句子,将原文变为句子的集合 sents = sentensor.tokenize(raw_text) corpus = [] for sen in sents: corpus.append(nltk.word_tokenize(sen))
    #将句子变为单词的集合
    print(len(corpus))   #corpus为二维小数组
    print(corpus[:3])
     输出结果:
    91007
    [['ufeffthe', 'project', 'gutenberg', 'ebook', 'of', 'great', 'expectations', ',', 'by', 'charles', 'dickens', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.'], ['you', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www.gutenberg.org', 'title', ':', 'great', 'expectations', 'author', ':', 'charles', 'dickens', 'posting', 'date', ':', 'august', '20', ',', '2008', '[', 'ebook', '#', '1400', ']', 'release', 'date', ':', 'july', ',', '1998', 'last', 'updated', ':', 'september', '25', ',', '2016', 'language', ':', 'english', 'character', 'set', 'encoding', ':', 'utf-8', '***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'great', 'expectations', '***', 'produced', 'by', 'an', 'anonymous', 'volunteer', 'great', 'expectations', '[', '1867', 'edition', ']', 'by', 'charles', 'dickens', '[', 'project', 'gutenberg', 'editor’s', 'note', ':', 'there', 'is', 'also', 'another', 'version', 'of', 'this', 'work', 'etext98/grexp10.txt', 'scanned', 'from', 'a', 'different', 'edition', ']', 'chapter', 'i', 'my', 'father’s', 'family', 'name', 'being', 'pirrip', ',', 'and', 'my', 'christian', 'name', 'philip', ',', 'my', 'infant', 'tongue', 'could', 'make', 'of', 'both', 'names', 'nothing', 'longer', 'or', 'more', 'explicit', 'than', 'pip', '.'], ['so', ',', 'i', 'called', 'myself', 'pip', ',', 'and', 'came', 'to', 'be', 'called', 'pip', '.']]
    
     

    好,w2v乱炖:

    In [120]:
    w2v_model = Word2Vec(corpus, size=128, window=5, min_count=5, workers=4)  #单词的学习利用迁移学习,直接拿一个较好的word2vec模型
    #虽然单词学习的模型用的其他语料,但是我们注重的是LSTM中对风格的学习,所以可以采用已经需练好的单词模型
    In [121]:
    w2v_model['office']
    
    Out[121]:
    array([-0.01398709,  0.15975526,  0.03589381, -0.4449192 ,  0.365403  ,
            0.13376504,  0.78731823,  0.01640314, -0.29723561, -0.21117583,
            0.13451998, -0.65348488,  0.06038611, -0.02000343,  0.05698346,
            0.68013376,  0.19010596,  0.56921762,  0.66904438, -0.08069923,
           -0.30662233,  0.26082459, -0.74816126, -0.41383636, -0.56303871,
           -0.10834043, -0.10635001, -0.7193433 ,  0.29722607, -0.83104628,
            1.11914253, -0.34119046, -0.39490014, -0.34709939, -0.00583572,
            0.17824887,  0.43295503,  0.11827419, -0.28707108, -0.02838829,
            0.02565269,  0.10328653, -0.19100265, -0.24102989,  0.23023468,
            0.51493132,  0.34759828,  0.05510307,  0.20583512, -0.17160387,
           -0.10351282,  0.19884749, -0.03935663, -0.04055062,  0.38888735,
           -0.02003323, -0.16577065, -0.15858875,  0.45083243, -0.09268586,
           -0.91098118,  0.16775337,  0.3432925 ,  0.2103184 , -0.42439541,
            0.26097715, -0.10714807,  0.2415273 ,  0.2352251 , -0.21662289,
           -0.13343927,  0.11787982, -0.31010333,  0.21146733, -0.11726214,
           -0.65574747,  0.04007725, -0.12032496, -0.03468512,  0.11063002,
            0.33530036, -0.64098376,  0.34013858, -0.08341357, -0.54826909,
            0.0723564 , -0.05169795, -0.19633259,  0.08620321,  0.05993884,
           -0.14693044, -0.40531522, -0.07695422,  0.2279872 , -0.12342903,
           -0.1919964 , -0.09589464,  0.4433476 ,  0.38304719,  1.0319351 ,
            0.82628119,  0.3677327 ,  0.07600326,  0.08538571, -0.44261214,
           -0.10997667, -0.03823839,  0.40593523,  0.32665277, -0.67680383,
            0.32504487,  0.4009226 ,  0.23463745, -0.21442334,  0.42727917,
            0.19593567, -0.10731711, -0.01080817, -0.14738144,  0.15710345,
           -0.01099576,  0.35833639,  0.16394758, -0.10431164, -0.28202233,
            0.24488974,  0.69327635, -0.29230621], dtype=float32)
     

    接下来,其实我们还是以之前的方式来处理我们的training data,把源数据变成一个长长的x,好让LSTM学会predict下一个单词:

    In [122]:
    raw_input = [item for sublist in corpus for item in sublist]  #将二维的corpus平坦化flatten成一个一维数组
    len(raw_input)
    
    Out[122]:
    2115170
    In [123]:
    raw_input[12]
    
    Out[123]:
    'ebook'
    In [124]:去掉不在vocabulary中的单词,将在的连接起来
    text_stream = []
    vocab = w2v_model.vocab
    for word in raw_input:
        if word in vocab:
            text_stream.append(word)
    len(text_stream)
    
    Out[124]:
    2058753
     

    我们这里的文本预测就是,给了前面的单词以后,下一个单词是谁?

    比如,hello from the other, 给出 side

    构造训练测试集

    我们需要把我们的raw text变成可以用来训练的x,y:

    x 是前置字母们 y 是后一个字母

    In [125]:
    seq_length = 10
    x = []
    y = []
    for i in range(0, len(text_stream) - seq_length):
    
        given = text_stream[i:i + seq_length]
        predict = text_stream[i + seq_length]
        x.append(np.array([w2v_model[word] for word in given]))
        y.append(w2v_model[predict])
    
     

    我们可以看看我们做好的数据集的长相:

    In [126]:
    print(x[10])
    print(y[10])
    
     
    [[-0.02218935  0.04861801 -0.03001036 ...,  0.07096259  0.16345282
      -0.18007144]
     [ 0.1663752   0.67981642  0.36581406 ...,  1.03355932  0.94110376
      -1.02763569]
     [-0.12611888  0.75773817  0.00454156 ...,  0.80544478  2.77890372
      -1.00110698]
     ..., 
     [ 0.34167829 -0.28152692 -0.12020591 ...,  0.19967555  1.65415502
      -1.97690392]
     [-0.66742641  0.82389861 -1.22558379 ...,  0.12269551  0.30856156
       0.29964617]
     [-0.17075984  0.0066567  -0.3894183  ...,  0.23729582  0.41993639
      -0.12582727]]
    [ 0.18125793 -1.72401989 -0.13503326 -0.42429626  1.40763748 -2.16775346
      2.26685596 -2.03301549  0.42729807 -0.84830129  0.56945151  0.87243706
      3.01571465 -0.38155749 -0.99618471  1.1960727   1.93537641  0.81187075
     -0.83017075 -3.18952608  0.48388934 -0.03766865 -1.68608069 -1.84907544
     -0.95259917  0.49039507 -0.40943271  0.12804921  1.35876858  0.72395176
      1.43591952 -0.41952157  0.38778016 -0.75301784 -2.5016799  -0.85931653
     -1.39363682  0.42932403  1.77297652  0.41443667 -1.30974782 -0.08950856
     -0.15183811 -1.59824061 -1.58920395  1.03765178  2.07559252  2.79692245
      1.11855054 -0.25542653 -1.04980111 -0.86929852 -1.26279402 -1.14124119
     -1.04608357  1.97869778 -2.23650813 -2.18115139 -0.26534671  0.39432198
     -0.06398458 -1.02308178  1.43372631 -0.02581184 -0.96472031 -3.08931994
     -0.67289352  1.06766248 -1.95796657  1.40857184  0.61604798 -0.50270212
     -2.33530831  0.45953822  0.37867084 -0.56957626 -1.90680516 -0.57678169
      0.50550407 -0.30320352  0.19682285  1.88185465 -1.40448165 -0.43952951
      1.95433044  2.07346153  0.22390689 -0.95107335 -0.24579825 -0.21493609
      0.66570002 -0.59126669 -1.4761591   0.86431485  0.36701021  0.12569368
      1.65063572  2.048352    1.81440067 -1.36734581  2.41072559  1.30975604
     -0.36556485 -0.89859813  1.28804696 -2.75488496  1.5667206  -1.75327337
      0.60426879  1.77851915 -0.32698369  0.55594021  2.01069188 -0.52870172
     -0.39022744 -1.1704396   1.28902853 -0.89315164  1.41299319  0.43392688
     -2.52578211 -1.13480854 -1.05396986 -0.85470092  0.6618616   1.23047733
     -0.28597715 -2.35096407]
    
    In [127]:
    print(len(x))
    print(len(y))
    print(len(x[12]))
    print(len(x[12][0]))
    print(len(y[12]))
    
     
    2058743
    2058743
    10
    128
    128
    
    In [128]:
    x = np.reshape(x, (-1, seq_length, 128))
    y = np.reshape(y, (-1,128))
    
     

    接下来我们做两件事:

    1. 我们已经有了一个input的数字表达(w2v),我们要把它变成LSTM需要的数组格式: [样本数,时间步伐,特征]

    2. 第二,对于output,我们直接用128维的输出(自己设定的单词向量的维度)

     

    模型建造

    LSTM模型构建

    In [129]:
    model = Sequential()
    model.add(LSTM(256, dropout_W=0.2, dropout_U=0.2, input_shape=(seq_length, 128)))  #增加Dropout
    model.add(Dropout(0.2))
    model.add(Dense(128, activation='sigmoid'))
    model.compile(loss='mse', optimizer='adam')
    
     

    跑模型

    In [130]:
    model.fit(x, y, nb_epoch=50, batch_size=4096)
    
     
    Epoch 1/50
    2058743/2058743 [==============================] - 150s - loss: 0.6839   
    Epoch 2/50
    2058743/2058743 [==============================] - 150s - loss: 0.6670   
    Epoch 3/50
    2058743/2058743 [==============================] - 150s - loss: 0.6625   
    Epoch 4/50
    2058743/2058743 [==============================] - 150s - loss: 0.6598   
    Epoch 5/50
    2058743/2058743 [==============================] - 150s - loss: 0.6577   
    Epoch 6/50
    2058743/2058743 [==============================] - 150s - loss: 0.6562   
    Epoch 7/50
    2058743/2058743 [==============================] - 150s - loss: 0.6549   
    Epoch 8/50
    2058743/2058743 [==============================] - 150s - loss: 0.6537   
    Epoch 9/50
    2058743/2058743 [==============================] - 150s - loss: 0.6527   
    Epoch 10/50
    2058743/2058743 [==============================] - 150s - loss: 0.6519   
    Epoch 11/50
    2058743/2058743 [==============================] - 150s - loss: 0.6512   
    Epoch 12/50
    2058743/2058743 [==============================] - 150s - loss: 0.6506   
    Epoch 13/50
    2058743/2058743 [==============================] - 150s - loss: 0.6500   
    Epoch 14/50
    2058743/2058743 [==============================] - 150s - loss: 0.6496   
    Epoch 15/50
    2058743/2058743 [==============================] - 150s - loss: 0.6492   
    Epoch 16/50
    2058743/2058743 [==============================] - 150s - loss: 0.6488   
    Epoch 17/50
    2058743/2058743 [==============================] - 151s - loss: 0.6485   
    Epoch 18/50
    2058743/2058743 [==============================] - 150s - loss: 0.6482   
    Epoch 19/50
    2058743/2058743 [==============================] - 150s - loss: 0.6480   
    Epoch 20/50
    2058743/2058743 [==============================] - 150s - loss: 0.6477   
    Epoch 21/50
    2058743/2058743 [==============================] - 150s - loss: 0.6475   
    Epoch 22/50
    2058743/2058743 [==============================] - 150s - loss: 0.6473   
    Epoch 23/50
    2058743/2058743 [==============================] - 150s - loss: 0.6471   
    Epoch 24/50
    2058743/2058743 [==============================] - 150s - loss: 0.6470   
    Epoch 25/50
    2058743/2058743 [==============================] - 150s - loss: 0.6468   
    Epoch 26/50
    2058743/2058743 [==============================] - 150s - loss: 0.6466   
    Epoch 27/50
    2058743/2058743 [==============================] - 150s - loss: 0.6464   
    Epoch 28/50
    2058743/2058743 [==============================] - 150s - loss: 0.6463   
    Epoch 29/50
    2058743/2058743 [==============================] - 150s - loss: 0.6462   
    Epoch 30/50
    2058743/2058743 [==============================] - 150s - loss: 0.6461   
    Epoch 31/50
    2058743/2058743 [==============================] - 150s - loss: 0.6460   
    Epoch 32/50
    2058743/2058743 [==============================] - 150s - loss: 0.6458   
    Epoch 33/50
    2058743/2058743 [==============================] - 150s - loss: 0.6458   
    Epoch 34/50
    2058743/2058743 [==============================] - 150s - loss: 0.6456   
    Epoch 35/50
    2058743/2058743 [==============================] - 150s - loss: 0.6456   
    Epoch 36/50
    2058743/2058743 [==============================] - 150s - loss: 0.6455   
    Epoch 37/50
    2058743/2058743 [==============================] - 150s - loss: 0.6454   
    Epoch 38/50
    2058743/2058743 [==============================] - 150s - loss: 0.6453   
    Epoch 39/50
    2058743/2058743 [==============================] - 150s - loss: 0.6452   
    Epoch 40/50
    2058743/2058743 [==============================] - 150s - loss: 0.6452   
    Epoch 41/50
    2058743/2058743 [==============================] - 150s - loss: 0.6451   
    Epoch 42/50
    2058743/2058743 [==============================] - 150s - loss: 0.6450   
    Epoch 43/50
    2058743/2058743 [==============================] - 150s - loss: 0.6450   
    Epoch 44/50
    2058743/2058743 [==============================] - 150s - loss: 0.6449   
    Epoch 45/50
    2058743/2058743 [==============================] - 150s - loss: 0.6448   
    Epoch 46/50
    2058743/2058743 [==============================] - 150s - loss: 0.6447   
    Epoch 47/50
    2058743/2058743 [==============================] - 150s - loss: 0.6447   
    Epoch 48/50
    2058743/2058743 [==============================] - 150s - loss: 0.6446   
    Epoch 49/50
    2058743/2058743 [==============================] - 150s - loss: 0.6446   
    Epoch 50/50
    2058743/2058743 [==============================] - 150s - loss: 0.6445   
    
    Out[130]:
    <keras.callbacks.History at 0x7f6ed8816a58>
     

    我们来写个测试程序,看看我们训练出来的LSTM的效果:

    In [131]:
    def predict_next(input_array):
        x = np.reshape(input_array, (-1,seq_length,128))
        y = model.predict(x)
        return y
    
    def string_to_index(raw_input):
        raw_input = raw_input.lower()
        input_stream = nltk.word_tokenize(raw_input)
        res = []
        for word in input_stream[(len(input_stream)-seq_length):]:
            res.append(w2v_model[word])
        return res
    
    def y_to_word(y):
        word = w2v_model.most_similar(positive=y, topn=1)
        return word
    
     

    好,写成一个大程序:

    In [137]:
    def generate_article(init, rounds=30):
        in_string = init.lower()
        for i in range(rounds):
            n = y_to_word(predict_next(string_to_index(in_string)))
            in_string += ' ' + n[0][0]
        return in_string
    
    In [138]:
    init = 'Language Models allow us to measure how likely a sentence is, which is an important for Machine'
    article = generate_article(init)
    print(article)
    
     
    language models allow us to measure how likely a sentence is, which is an important for machine engagement . to-day good-for-nothing fit job job job job job . i feel thing job job job ; thing really done certainly job job ; but i need not say
    
    In [ ]:
     
  • 相关阅读:
    final和finally的区别
    ArrayList和LinkedList的区别
    collection和collections的区别
    第三次作业
    第二次作业
    第零次作业
    最后一次作业-- 总结报告
    第14、15教学周作业
    GridView去掉边框! 【转载于:http://magicpeng99.blog.sohu.com/】
    ASP.NET支持用Menu显示web.sitemap中定义好的网站链接 【转载】
  • 原文地址:https://www.cnblogs.com/Josie-chen/p/9099085.html
Copyright © 2011-2022 走看看