zoukankan      html  css  js  c++  java
  • NLP(二十一)根据已有文本LSTM自动生成文本

    原文链接:http://www.one2know.cn/nlp21/

    根据已有文本LSTM自动生成文本

    from __future__ import print_function
    import numpy as np
    import random
    import sys
    
    path = r'shakespeare_final.txt'
    text = open(path).read().lower() # 打开文档 读成字符串 然后都变小写
    characters = sorted(list(set(text))) # 去掉重复字符 方便下面编码
    print('corpus length:',len(text))
    print('total chars:',len(characters))
    
    char2indices = dict((c,i) for i,c in enumerate(characters)) # 字符(字母等)=>索引(数字)
    indices2char = dict((i,c) for i,c in enumerate(characters)) # 索引(数字)=>字符(字母等)
    
    maxlen = 40 # 40个字符长度预测下一个字符
    step = 3 # 一次预测3个
    sentences = []
    next_chars = []
    for i in range(0,len(text)-maxlen,step):
        sentences.append(text[i:i+maxlen])
        next_chars.append(text[i+maxlen])
    print('nb sentences:',len(sentences)) # 40个字符串作为特征句子的个数 即训练数据大小
    
    ## 构造数据集 类似one-hot编码
    X = np.zeros((len(sentences),maxlen,len(characters)),dtype=np.bool)
    y = np.zeros((len(sentences),len(characters)),dtype=np.bool)
    for i,sentence in enumerate(sentences):
        for t,char in enumerate(sentence):
            X[i,t,char2indices[char]] = 1
        y[i,char2indices[next_chars[i]]] = 1
    
    # 构建神经网路
    from keras.models import Sequential
    from keras.layers import Dense,LSTM,Activation,Dropout
    from keras.optimizers import RMSprop
    model = Sequential()
    model.add(LSTM(128,input_shape=(maxlen,len(characters))))
    model.add(Dense(len(characters)))
    model.add(Activation('softmax'))
    model.compile(loss='categorical_crossentropy',optimizer=RMSprop(lr=0.01))
    print(model.summary())
    
    def pred_indices(preds,metric=1.0):
        preds = np.asarray(preds).astype('float64')
        preds = np.log(preds) / metric
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)
        probs = np.random.multinomial(1,preds,1)
        return np.argmax(probs)
    
    for iteration in range(1,30): # 便于观察每一轮的训练结构
        print('-' * 40)
        print('Iteration',iteration)
        model.fit(X,y,batch_size=128,epochs=1)
        start_index = random.randint(0,len(text)-maxlen-1)
        for diversity in [0.2,0.7,1.2]:
            print('
    ----- diversity:',diversity)
            generated = ''
            sentence = text[start_index:start_index+maxlen]
            generated += sentence
            print('----- Generating with seed: "'+sentence+'"')
            sys.stdout.write(generated)
            for i in range(400):
                x = np.zeros((1,maxlen,len(characters)))
                for t,char in enumerate(sentence): # 数字索引=>字母
                    x[0,t,char2indices[char]] = 1
                preds = model.predict(x,verbose=0)[0]
                next_index = pred_indices(preds,diversity)
                pred_char = indices2char[next_index]
                generated += pred_char
                sentence = sentence[1:] + pred_char
                sys.stdout.write(pred_char)
                sys.stdout.flush()
            print('
    One combination completed 
    ')
    

    输出:

    corpus length: 581432
    total chars: 61
    nb sentences: 193798
    Using TensorFlow backend.
    WARNING:tensorflow:From D:Anaconda3libsite-packages	ensorflowpythonframeworkop_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Colocations handled automatically by placer.
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    lstm_1 (LSTM)                (None, 128)               97280     
    _________________________________________________________________
    dense_1 (Dense)              (None, 61)                7869      
    _________________________________________________________________
    activation_1 (Activation)    (None, 61)                0         
    =================================================================
    Total params: 105,149
    Trainable params: 105,149
    Non-trainable params: 0
    _________________________________________________________________
    None
    ----------------------------------------
    Iteration 1
    WARNING:tensorflow:From D:Anaconda3libsite-packages	ensorflowpythonopsmath_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use tf.cast instead.
    Epoch 1/1
    2019-07-15 17:04:03.721908: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2019-07-15 17:04:04.438003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
    name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate(GHz): 1.124
    pciBusID: 0000:01:00.0
    totalMemory: 2.00GiB freeMemory: 1.64GiB
    2019-07-15 17:04:04.438676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
    2019-07-15 17:04:07.352274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-07-15 17:04:07.352543: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
    2019-07-15 17:04:07.352701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
    2019-07-15 17:04:07.357455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1386 MB memory) -> physical GPU (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0, compute capability: 5.0)
    2019-07-15 17:04:08.415227: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally
    
       128/193798 [..............................] - ETA: 2:16:56 - loss: 4.1095
       256/193798 [..............................] - ETA: 1:09:23 - loss: 3.6938
       384/193798 [..............................] - ETA: 46:52 - loss: 3.8312 
       。。。
    
  • 相关阅读:
    总结
    webview细节注意
    对图片的处理
    介绍并提高app中WebView的性能
    工作中新接触的问题
    iOS环信
    Framework静态库制作方法
    多线程GCD
    iOS开发之地图与定位
    ARC内存管理机制详解
  • 原文地址:https://www.cnblogs.com/peng8098/p/nlp_21.html
Copyright © 2011-2022 走看看