zoukankan      html  css  js  c++  java
  • RNN自然语言处理训练数据生成过程:示例

    第一次接触RNN很容易被数据处理弄糊涂,这里总结一下,总把每一步的处理结果都打印出来。

    数据下载

    shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
    filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
    with open(filepath) as f:
        shakespeare_text = f.read()
    "".join(sorted(set(shakespeare_text.lower())))
    tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
    tokenizer.fit_on_texts(shakespeare_text)
    tokenizer.texts_to_sequences(["First"])
    tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])
    max_id = len(tokenizer.word_index) # number of distinct characters
    dataset_size = tokenizer.document_count # total number of characters
    [encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
    train_size = dataset_size * 90 // 100
    dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
    

    数据处理

    为了方便理解,每一步处理都把结果打印出来

    for _ in dataset.take(1):
        print(_)
    # tf.Tensor(19, shape=(), dtype=int64)
    
    n_steps = 100
    window_length = n_steps + 1 # target = input shifted 1 character ahead
    dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)
    
    for _ in dataset.take(1):
        print(_)
    # <_VariantDataset shapes: (), types: tf.int64>
    
    dataset = dataset.flat_map(lambda window: window.batch(window_length))
    
    for _ in dataset.take(1):
        print(_)
    
    tf.Tensor(
    [19  5  8  7  2  0 18  5  2  5 35  1  9 23 10 21  1 19  3  8  1  0 16  1
      0 22  8  3 18  1  1 12  0  4  9 15  0 19 13  8  2  6  1  8 17  0  6  1
      4  8  0 14  1  0  7 22  1  4 24 26 10 10  4 11 11 23 10  7 22  1  4 24
     17  0  7 22  1  4 24 26 10 10 19  5  8  7  2  0 18  5  2  5 35  1  9 23
     10 15  3 13  0], shape=(101,), dtype=int64)
    
    batch_size = 32
    # dataset = dataset.shuffle(10000).batch(batch_size) 实际使用时取消注释
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
    
    for _ in dataset.take(1):
        print(_[0][0]) # X
    
    tf.Tensor(
    [19  5  8  7  2  0 18  5  2  5 35  1  9 23 10 21  1 19  3  8  1  0 16  1
      0 22  8  3 18  1  1 12  0  4  9 15  0 19 13  8  2  6  1  8 17  0  6  1
      4  8  0 14  1  0  7 22  1  4 24 26 10 10  4 11 11 23 10  7 22  1  4 24
     17  0  7 22  1  4 24 26 10 10 19  5  8  7  2  0 18  5  2  5 35  1  9 23
     10 15  3 13], shape=(100,), dtype=int64)
    
    for _ in dataset.take(1):
       print(_[1][0]) # Y
    
    tf.Tensor(
    [ 5  8  7  2  0 18  5  2  5 35  1  9 23 10 21  1 19  3  8  1  0 16  1  0
     22  8  3 18  1  1 12  0  4  9 15  0 19 13  8  2  6  1  8 17  0  6  1  4
      8  0 14  1  0  7 22  1  4 24 26 10 10  4 11 11 23 10  7 22  1  4 24 17
      0  7 22  1  4 24 26 10 10 19  5  8  7  2  0 18  5  2  5 35  1  9 23 10
     15  3 13  0], shape=(100,), dtype=int64)
    

    总结

    1. 先从数据集中用windows_size=101来分解字母
    2. 0-99(共100个)作为X,1-100(共100个)作为Y
  • 相关阅读:
    重定向 重写
    php 安装 event 和 libevent 扩展
    curl 和 tcpdump
    yum 升级php版本
    shell 教程
    shell脚本 inotify + rsync 同步脚本
    nodesj中 中间件express-session的理解
    node.js中express-session配置项详解
    我对面向对象的理解
    BootstrapValidator
  • 原文地址:https://www.cnblogs.com/yaos/p/14014143.html
Copyright © 2011-2022 走看看