第一次接触RNN很容易被数据处理弄糊涂,这里总结一下,总把每一步的处理结果都打印出来。
数据下载
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
shakespeare_text = f.read()
"".join(sorted(set(shakespeare_text.lower())))
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)
tokenizer.texts_to_sequences(["First"])
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])
数据处理
为了方便理解,每一步处理都把结果打印出来
for _ in dataset.take(1):
print(_)
# tf.Tensor(19, shape=(), dtype=int64)
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)
for _ in dataset.take(1):
print(_)
# <_VariantDataset shapes: (), types: tf.int64>
dataset = dataset.flat_map(lambda window: window.batch(window_length))
for _ in dataset.take(1):
print(_)
tf.Tensor(
[19 5 8 7 2 0 18 5 2 5 35 1 9 23 10 21 1 19 3 8 1 0 16 1
0 22 8 3 18 1 1 12 0 4 9 15 0 19 13 8 2 6 1 8 17 0 6 1
4 8 0 14 1 0 7 22 1 4 24 26 10 10 4 11 11 23 10 7 22 1 4 24
17 0 7 22 1 4 24 26 10 10 19 5 8 7 2 0 18 5 2 5 35 1 9 23
10 15 3 13 0], shape=(101,), dtype=int64)
batch_size = 32
# dataset = dataset.shuffle(10000).batch(batch_size) 实际使用时取消注释
dataset = dataset.batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))
for _ in dataset.take(1):
print(_[0][0]) # X
tf.Tensor(
[19 5 8 7 2 0 18 5 2 5 35 1 9 23 10 21 1 19 3 8 1 0 16 1
0 22 8 3 18 1 1 12 0 4 9 15 0 19 13 8 2 6 1 8 17 0 6 1
4 8 0 14 1 0 7 22 1 4 24 26 10 10 4 11 11 23 10 7 22 1 4 24
17 0 7 22 1 4 24 26 10 10 19 5 8 7 2 0 18 5 2 5 35 1 9 23
10 15 3 13], shape=(100,), dtype=int64)
for _ in dataset.take(1):
print(_[1][0]) # Y
tf.Tensor(
[ 5 8 7 2 0 18 5 2 5 35 1 9 23 10 21 1 19 3 8 1 0 16 1 0
22 8 3 18 1 1 12 0 4 9 15 0 19 13 8 2 6 1 8 17 0 6 1 4
8 0 14 1 0 7 22 1 4 24 26 10 10 4 11 11 23 10 7 22 1 4 24 17
0 7 22 1 4 24 26 10 10 19 5 8 7 2 0 18 5 2 5 35 1 9 23 10
15 3 13 0], shape=(100,), dtype=int64)
总结
- 先从数据集中用windows_size=101来分解字母
- 0-99(共100个)作为X,1-100(共100个)作为Y