cs20_9-1

zoukankan html css js c++ java

cs20_9-1
1. 理论review

1.1 RNN
1. Agenda
  
  From feed-forward to recurrent / Tricks & treats / Presidential tweets
2. BPTT
  
  Use sum of gradients each all timesteps to update parameters
  
  Computationally expensive for a lot of timesteps
3. Truncated BPTT: A fixed number of timesteps + BPTT
  
  Won’t be able to capture the full sequential dependencies
  
  In non-eager TensorFlow, have to make sure all inputs are of the same length
4. Exploding/vanishing Gradients
5. the problems of RNN: In practice, RNNs aren’t very good at capturing long-term dependencies
1.2 LSTMs
1. Using a gating mechanism to control what information to enter and emit from the cell at each timestep
  
  input gate: how much of the current input to let through.
  
  forget gate: how much of the previous state to take into account.
  
  output gate: how much of the hidden state to expose to the next step.
  
  candidate gate (cell unit ?)
  
  final memory cell
  
  感觉必要时，需要参考cs224d(stanford)
1.3 GRUs(Gated Recurrent Units)
1. 简化的LSTMs(性能相当，但是计算成本更低)
2. 理解LSTM 和 GRU，来自 cs224d
  
  https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf
  
  Intuitively, we can think of the gate as controlling what information to enter and emit from the cell at each timestep. All the gates have the same dimensions.
  
  input gate: decides how much of the current input to let through.
  
  forget gate: defines how much of the previous state(cell state) to take into account.
  
  output gate: defines how much of the hidden state to expose to the next timestep.
  
  candidate gate: similar to the original RNN, this gate computes the candidate hidden state based on the previous hidden state and the current input.
  
  final memory cell: the internal memory of the unit combines the candidate hidden state with the input/forget gate information. Final memory cell is then computed with the output gate to decide how much of the final memory cell is to be output as the hidden state for this current step.
  
  LSTM vs GRU
1.4 what RNNs can do ? (还是看cs224d吧)

2. RNNs in Tensorflow

2.1 基本介绍
1. BasicRNNCell / RNNCell / BasicLSTMCell / LSTMCell / GRUCell
2. 简单代码
  
  # cell cell = tf.nn.rnn_cell.GRUCell(hidden_size) # layers layers = [tf.nn.rnn_cell.GRUCell(size) for size in hidden_sizes] cells = tf.nn.rnn_cell.MultiRNNCell(layers) # 感觉是把上面的cells link 起来 # Construct Recurrent Neural Network 1) tf.nn.dynamic_rnn: uses a tf.While loop to dynamically construct the graph when it is executed. Graph creation is faster and you can feed batches of variable size. 2) tf.nn.bidirectional_dynamic_rnn: dynamic_rnn with bidirectional # Stack multiple cells layers = [tf.nn.rnn_cell.GRUCell(size) for size in hidden_sizes] cells = tf.nn.rnn_cell.MultiRNNCell(layers) output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state) # Most sequences are not of the same length # Dealing with variable sequence length 1) Pad all sequences with zero vectors and all labels with zero label (to make them of the same length) 2) Most current models can’t deal with sequences of length larger than 120 tokens, so there is usually a fixed max_length and we truncate the sequences to that max_length # the problem Padded/truncated sequence length 1) The padded labels change the total loss, which affects the gradients # apporach to solve above problem Apporach-1 1-1) Maintain a mask (True for real, False for padded tokens) 1-2) Run your model on both the real/padded tokens (model will predict labels for the padded tokens as well) 1-3) Only take into account the loss caused by the real elements # full_loss = tf.nn.softmax_cross_entropy_with_logits(preds, labels) loss = tf.reduce_mean(tf.boolean_mask(full_loss, mask)) # Approach-2: Let your model know the real sequence length so it only predict the labels for the real tokens # cell = tf.nn.rnn_cell.GRUCell(hidden_size) rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers) tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1) output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)
3. Tips and Tricks
  
  Vanishing Gradients
  
  Use different activation units:
  
  tf.nn.relu / tf.nn.relu6 / tf.nn.crelu / tf.nn.elu
  
  In addition to: tf.nn.softplus / tf.nn.softsign / tf.nn.bias_add / tf.sigmoid / tf.tanh
  
  Exploding Gradients
  
  Clip gradients with tf.clip_by_global_norm
  
  gradients = tf.gradients(cost, tf.trainable_variables()) # take gradients of cosst w.r.t. all trainable variables clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_grad_norm) # clip the gradients by a pre-defined max norm optimizer = tf.train.AdamOptimizer(learning_rate) train_op = optimizer.apply_gradients(zip(gradients, trainables)) # add the clipped gradients to the optimizer
  
  Anneal the learning rate
  
  # Optimizers accept both scalars and tensors as learning rate learning_rate = tf.train.exponential_decay(init_lr, global_step, decay_steps, decay_rate, staircase=True) optimizer = tf.train.AdamOptimizer(learning_rate)
  
  Overfitting
  
  Use dropout through tf.nn.dropout or DropoutWrapper for cells
  
  # tf.nn.dropout hidden_layer = tf.nn.dropout(hidden_layer, keep_prob) # DropoutWrapper cell = tf.nn.rnn_cell.GRUCell(hidden_size) cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=keep_prob)
2.2 Language Modeling
1. Language Modeling: Main approaches
  
  Word-level: n-grams
  
  Character-level
  
  Subword-level: somewhere in between the two above
2. Language Modeling: N-grams
  
  The traditional approach up until very recently. Train a model to predict the next word based on previous n-grams
  
  problem of above N-grams ?
  
  Huge vocabulary
  
  Can’t generalize to OOV (out of vocabulary)
  
  Requires a lot of memory
3. Language Modeling: Character-level
  
  Introduced in the early 2010s. Both input and output are characters
  
  the Pros & Cons of above Ch..-Level ?
  
  Pros:
  
  Very small vocabulary
  
  Doesn’t require word embeddings
  
  Faster to train
  
  Cons:
  
  Low fluency (many words can be gibberish)
4. Language Modeling: Hybrid
  
  Word-level by default, switching to character-level for unknown tokens
5. Language Modeling: Subword-Level (介于 Character-level 与 Word-level 之间)
  
  Input and output are subwords
  
  Keep W most frequent words
  
  Keep S most frequent syllables
  
  Split the rest into characters
  
  Seem to perform better than both word-level and character-level models*
  
  举例：
  
  new company dreamworks interactive new company dre+ am+ wo+ rks: in+ te+ ra+ cti+ ve:
2.3 demo code: Char-RNN Language Modeling
1. Goal: Presidential Tweet Bot (用tweet dataset训练一个发 tweet 的 bot，即基于字符级别的文本生成)
2. Dataset: 19,469 Donald Trump’s tweets
3. 参考：
  
  [1] https://docs.google.com/presentation/d/1QydMhsGFeUzDYZr7dV0tt9WJryleOoKUyJth4ftRDbA/edit#slide=id.g1ce324a1fd_0_496
  
  [2] https://docs.google.com/document/d/1_ZqzBqFMV8YmdC2PmaTXOB9O1BZ14yLNTQd2_Bkff9Y/edit#
  
  (重点看关于 tweet bot的细节)
  
  [3] http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (Andrej Karpathy的一些神奇发现)
4. 更多实验细节，todo
  
  代码参考：https://github.com/chiphuyen/stanford-tensorflow-tutorials/blob/master/examples/11_char_rnn.py
ZhiHu ：HaoZhang的知乎

GitHub：HaoZhang的GitHub

Gmail ：njuhaozhang@gmail.com
查看全文

相关阅读:
acdream 1017: Fast Transportation 网络流层次图
 centos5的kernel source
Linux内核源代码的阅读及相关工具介绍(转)
gcc生成静态库和动态库(转)
写一篇大家一看就会的教程《JNI初步》(转)
jni.h
5分钟学用Lucene
(VC2005) picture 控件显示16*16的Icon
(VC/MFC)通过结构体传递参数给线程
 (VC2005)MFC中添加控件的成员变量.

原文地址：https://www.cnblogs.com/LS1314/p/10371237.html

1. 理论review

1.1 RNN

1.2 LSTMs

1.3 GRUs(Gated Recurrent Units)

1.4 what RNNs can do ? (还是看cs224d吧)

2. RNNs in Tensorflow

2.1 基本介绍

2.2 Language Modeling

2.3 demo code: Char-RNN Language Modeling