zoukankan      html  css  js  c++  java
  • cs20_9-1

    1. 理论review

    1.1 RNN

    1. Agenda

      • From feed-forward to recurrent / Tricks & treats / Presidential tweets
    2. BPTT

      • Use sum of gradients each all timesteps to update parameters
      • Computationally expensive for a lot of timesteps
    3. Truncated BPTT: A fixed number of timesteps + BPTT

      • Won’t be able to capture the full sequential dependencies
      • In non-eager TensorFlow, have to make sure all inputs are of the same length
    4. Exploding/vanishing Gradients

    5. the problems of RNN: In practice, RNNs aren’t very good at capturing long-term dependencies

    1.2 LSTMs

    1. Using a gating mechanism to control what information to enter and emit from the cell at each timestep
      • input gate: how much of the current input to let through.
      • forget gate: how much of the previous state to take into account.
      • output gate: how much of the hidden state to expose to the next step.
      • candidate gate (cell unit ?)
      • final memory cell
      • 感觉必要时,需要参考cs224d(stanford)

    1.3 GRUs(Gated Recurrent Units)

    1. 简化的LSTMs(性能相当,但是计算成本更低)

    2. 理解LSTM 和 GRU,来自 cs224d

      • https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf

      • Intuitively, we can think of the gate as controlling what information to enter and emit from the cell at each timestep. All the gates have the same dimensions.

        • input gate: decides how much of the current input to let through.
        • forget gate: defines how much of the previous state(cell state) to take into account.
        • output gate: defines how much of the hidden state to expose to the next timestep.
        • candidate gate: similar to the original RNN, this gate computes the candidate hidden state based on the previous hidden state and the current input.
        • final memory cell: the internal memory of the unit combines the candidate hidden state with the input/forget gate information. Final memory cell is then computed with the output gate to decide how much of the final memory cell is to be output as the hidden state for this current step.
      • LSTM vs GRU

    1.4 what RNNs can do ? (还是看cs224d吧)

    2. RNNs in Tensorflow

    2.1 基本介绍

    1. BasicRNNCell / RNNCell / BasicLSTMCell / LSTMCell / GRUCell

    2. 简单代码

      # cell
      cell = tf.nn.rnn_cell.GRUCell(hidden_size)
      
      # layers
      layers = [tf.nn.rnn_cell.GRUCell(size) for size in hidden_sizes]
      cells = tf.nn.rnn_cell.MultiRNNCell(layers) # 感觉是把上面的cells link 起来
      
      # Construct Recurrent Neural Network
      1) tf.nn.dynamic_rnn: 
      uses a tf.While loop to dynamically construct the graph when it is executed. Graph creation is faster and you can feed batches of variable size.
      2) tf.nn.bidirectional_dynamic_rnn: 
      dynamic_rnn with bidirectional
      
      # Stack multiple cells
      layers = [tf.nn.rnn_cell.GRUCell(size) for size in hidden_sizes]
      cells = tf.nn.rnn_cell.MultiRNNCell(layers)
      output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)
      # Most sequences are not of the same length
      
      # Dealing with variable sequence length
      1) Pad all sequences with zero vectors and all labels with zero label (to make them of the same length)
      2) Most current models can’t deal with sequences of length larger than 120 tokens, so there is usually a fixed max_length and we truncate the sequences to that max_length
      
      # the problem Padded/truncated sequence length 
      1) The padded labels change the total loss, which affects the gradients
      
      # apporach to solve above problem
      Apporach-1
      1-1) Maintain a mask (True for real, False for padded tokens)
      1-2) Run your model on both the real/padded tokens (model will predict labels for the padded tokens as well)
      1-3) Only take into account the loss caused by the real elements
      #
      full_loss = tf.nn.softmax_cross_entropy_with_logits(preds, labels)
      loss = tf.reduce_mean(tf.boolean_mask(full_loss, mask))
      #
      Approach-2:
      Let your model know the real sequence length so it only predict the labels for the real tokens
      #
      cell = tf.nn.rnn_cell.GRUCell(hidden_size)
      rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
      tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
      output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)
      
    3. Tips and Tricks

      • Vanishing Gradients

        • Use different activation units:
          • tf.nn.relu / tf.nn.relu6 / tf.nn.crelu / tf.nn.elu
        • In addition to: tf.nn.softplus / tf.nn.softsign / tf.nn.bias_add / tf.sigmoid / tf.tanh
      • Exploding Gradients

        • Clip gradients with tf.clip_by_global_norm

          gradients = tf.gradients(cost, tf.trainable_variables())
          # take gradients of cosst w.r.t. all trainable variables
          
          clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_grad_norm)
          # clip the gradients by a pre-defined max norm
          
          optimizer = tf.train.AdamOptimizer(learning_rate)
          train_op = optimizer.apply_gradients(zip(gradients, trainables))
          # add the clipped gradients to the optimizer
          
      • Anneal the learning rate

        # Optimizers accept both scalars and tensors as learning rate
        
        learning_rate = tf.train.exponential_decay(init_lr, 
        										   global_step, 
        										   decay_steps, 
        										   decay_rate, 
        										   staircase=True)
        optimizer = tf.train.AdamOptimizer(learning_rate)
        
      • Overfitting

        • Use dropout through tf.nn.dropout or DropoutWrapper for cells

          # tf.nn.dropout
          hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)
          
          # DropoutWrapper
          cell = tf.nn.rnn_cell.GRUCell(hidden_size)
          cell = tf.nn.rnn_cell.DropoutWrapper(cell,     
                                              output_keep_prob=keep_prob)
          

    2.2 Language Modeling

    1. Language Modeling: Main approaches

      • Word-level: n-grams
      • Character-level
      • Subword-level: somewhere in between the two above
    2. Language Modeling: N-grams

      • The traditional approach up until very recently. Train a model to predict the next word based on previous n-grams
      • problem of above N-grams ?
        • Huge vocabulary
        • Can’t generalize to OOV (out of vocabulary)
        • Requires a lot of memory
    3. Language Modeling: Character-level

      • Introduced in the early 2010s. Both input and output are characters

      • the Pros & Cons of above Ch..-Level ?

      • Pros:

        • Very small vocabulary
        • Doesn’t require word embeddings
        • Faster to train

        Cons:

        • Low fluency (many words can be gibberish)
    4. Language Modeling: Hybrid

      • Word-level by default, switching to character-level for unknown tokens
    5. Language Modeling: Subword-Level (介于 Character-level 与 Word-level 之间)

      • Input and output are subwords

      • Keep W most frequent words

      • Keep S most frequent syllables

      • Split the rest into characters

      • Seem to perform better than both word-level and character-level models*

      • 举例:

        new company dreamworks interactive 
        new company dre+ am+ wo+ rks: in+ te+ ra+ cti+ ve:
        

    2.3 demo code: Char-RNN Language Modeling

    1. Goal: Presidential Tweet Bot (用tweet dataset训练一个发 tweet 的 bot,即 基于字符级别的文本生成)

    2. Dataset: 19,469 Donald Trump’s tweets

    3. 参考:

      [1] https://docs.google.com/presentation/d/1QydMhsGFeUzDYZr7dV0tt9WJryleOoKUyJth4ftRDbA/edit#slide=id.g1ce324a1fd_0_496

      [2] https://docs.google.com/document/d/1_ZqzBqFMV8YmdC2PmaTXOB9O1BZ14yLNTQd2_Bkff9Y/edit#

      (重点看关于 tweet bot的细节)

      [3] http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (Andrej Karpathy的一些神奇发现)

    4. 更多实验细节,todo

  • 相关阅读:
    Python 面向对象(类与对象)
    修改Centos文 7件夹所在组及权限(不同用户共用一个文件夹)
    CGPoint,CGSize,CGRect
    关于投资币圈的思考和微信群问答干货整理
    数字货币合约交易基础知识
    比特币成长大事记
    健康医疗笔记(四)癌症知识
    ​阿德勒心理学《被讨厌的勇气》一切烦恼皆源于人际关系
    免费学术论文网站Sci-Hub
    数字货币交易所开发笔记3-撮合引擎开发
  • 原文地址:https://www.cnblogs.com/LS1314/p/10371237.html
Copyright © 2011-2022 走看看