zoukankan      html  css  js  c++  java
  • Note of Compression of Neural Machine Translation Models via Pruning

    The problems of NMT Model

    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    source language input
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    target language input
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    target language output
    [Not supported by viewer]
    1. Over-Parameterization
    2. Long running time
    3. Overfitting
    4. Big Storage size

    The redundancies of NMT Model

    Most important: Higher Layers; Attention and Softmax Weights

    redundancy: lower layers; embedding weights;

    Traditional Solutions

    Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

    Recent Ways

    Magnitude based pruning with iterative retraining(基于幅度的剪枝与反复的重复训练)yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

    sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

    These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

    weight-pruning approaches

    weight-pruning approaches allow weights to be pruned freely and independently of each other

    many other compression techniques for neural networks

    1. approaches based on on low-rank approximations for weight matrices;
    2. weight sharing via hash functions;

    Understanding NMT Weights

    Weight Subgroups in LSTM

    details of LSTM:

    [left(egin{array}{c} {i} \ {f} \ {o} \ {hat{h}} end{array} ight)=left(egin{array}{c} {operatorname{sig} m} \ {operatorname{sig} m} \ {operatorname{sig} m} \ { anh } end{array} ight) T_{4 n, 2 n}left(egin{array}{c} {h_{t}^{l-1}} \ {h_{t-1}^{l}} end{array} ight) ]

    we get (left(h_{t}^{l}, c_{t}^{l} ight)) from the inputs of LSTM $left(h_{t-1}^{l}, c_{t-1}^{l} ight) $

    [egin{array}{l} {c_{t}^{l}=f circ c_{t-1}^{l}+i circ hat{h}} \ {h_{t}^{l}=o circ anh left(c_{t}^{l} ight)} end{array} ]

    (T_{4 n, 2 n}) is a matrix that is responsible for the parameters.

    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    source language input
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    target language input
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    [Not supported by viewer]
    target language output
    [Not supported by viewer]
    one-hot vectors
     length V
    [Not supported by viewer]
    word embeddings
    length n
    [Not supported by viewer]
    hidden layer 1
    length n
    [Not supported by viewer]
    hidden layer 2
    length n
    [Not supported by viewer]
    attention hidden layer
    length n
    [Not supported by viewer]
    length V
    [Not supported by viewer]
    one-hot vectors
    length V
    [Not supported by viewer]
    initial (zero)
    [Not supported by viewer]
    context vector
     (one for each
      target word)
         length n
    [Not supported by viewer]

    Pruning Schemes

    Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

    1. Class-blind: Take all parameters, sort them by magnitude and prune the (x \%) with smallest magnitude, regardless of weight class.
    2. Class-uniform: Within each class, sort the weights by magnitude and prune the (x \%) with smallest magnitude.

    With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial

  • 相关阅读:
    云中树莓派(5):利用 AWS IoT Greengrass 进行 IoT 边缘计算
    乐观锁 与 悲观锁 来解决数据库并发问题
    C++ assert 断言使用
    linux shell grep/awk/sed 匹配tab
    C++ 变量默认初始值不确定(代码测试)
    linux 查看机器内存方法 (free命令)
    html table奇偶行颜色设置 (CSS选择器)
  • 原文地址:https://www.cnblogs.com/wevolf/p/12105538.html
Copyright © 2011-2022 走看看