zoukankan      html  css  js  c++  java
  • Note of Compression of Neural Machine Translation Models via Pruning

    The problems of NMT Model

    I
    [Not supported by viewer]
    am
    [Not supported by viewer]
    a
    [Not supported by viewer]
    student
    [Not supported by viewer]
    source language input
    [Not supported by viewer]
    -
    [Not supported by viewer]
    Je
    [Not supported by viewer]
    suis
    [Not supported by viewer]
    étudiant
    [Not supported by viewer]
    target language input
    [Not supported by viewer]
    Je
    [Not supported by viewer]
    suis
    [Not supported by viewer]
    étudiant
    [Not supported by viewer]
    -
    [Not supported by viewer]
    target language output
    [Not supported by viewer]
    1. Over-Parameterization
    2. Long running time
    3. Overfitting
    4. Big Storage size

    The redundancies of NMT Model

    Most important: Higher Layers; Attention and Softmax Weights

    redundancy: lower layers; embedding weights;

    Traditional Solutions

    Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

    Recent Ways

    Magnitude based pruning with iterative retraining(基于幅度的剪枝与反复的重复训练)yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

    sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

    These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

    weight-pruning approaches

    weight-pruning approaches allow weights to be pruned freely and independently of each other

    many other compression techniques for neural networks

    1. approaches based on on low-rank approximations for weight matrices;
    2. weight sharing via hash functions;

    Understanding NMT Weights

    Weight Subgroups in LSTM

    details of LSTM:

    [left(egin{array}{c} {i} \ {f} \ {o} \ {hat{h}} end{array} ight)=left(egin{array}{c} {operatorname{sig} m} \ {operatorname{sig} m} \ {operatorname{sig} m} \ { anh } end{array} ight) T_{4 n, 2 n}left(egin{array}{c} {h_{t}^{l-1}} \ {h_{t-1}^{l}} end{array} ight) ]

    we get (left(h_{t}^{l}, c_{t}^{l} ight)) from the inputs of LSTM $left(h_{t-1}^{l}, c_{t-1}^{l} ight) $

    [egin{array}{l} {c_{t}^{l}=f circ c_{t-1}^{l}+i circ hat{h}} \ {h_{t}^{l}=o circ anh left(c_{t}^{l} ight)} end{array} ]

    (T_{4 n, 2 n}) is a matrix that is responsible for the parameters.

    I
    [Not supported by viewer]
    am
    [Not supported by viewer]
    a
    [Not supported by viewer]
    student
    [Not supported by viewer]
    source language input
    [Not supported by viewer]
    -
    [Not supported by viewer]
    Je
    [Not supported by viewer]
    suis
    [Not supported by viewer]
    étudiant
    [Not supported by viewer]
    target language input
    [Not supported by viewer]
    Je
    [Not supported by viewer]
    suis
    [Not supported by viewer]
    étudiant
    [Not supported by viewer]
    -
    [Not supported by viewer]
    target language output
    [Not supported by viewer]
    one-hot vectors
     length V
    [Not supported by viewer]
    word embeddings
    length n
    [Not supported by viewer]
    hidden layer 1
    length n
    [Not supported by viewer]
    hidden layer 2
    length n
    [Not supported by viewer]
    attention hidden layer
    length n
    [Not supported by viewer]
    scores
    length V
    [Not supported by viewer]
    one-hot vectors
    length V
    [Not supported by viewer]
    initial (zero)
        states
    [Not supported by viewer]
    context vector
     (one for each
      target word)
         length n
    [Not supported by viewer]

    Pruning Schemes

    Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

    1. Class-blind: Take all parameters, sort them by magnitude and prune the (x \%) with smallest magnitude, regardless of weight class.
    2. Class-uniform: Within each class, sort the weights by magnitude and prune the (x \%) with smallest magnitude.

    With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial

  • 相关阅读:
    python操作csv,对比两个csv文件某列值
    监控端口和僵尸进程脚本
    openldap创建只读账号
    shell 判断文件内容是否改变
    golang调用shell命令标准输出阻塞管道
    fexpect 源码
    python pexpect 免交互自动恢复gitlab数据
    consul client agent 本地读取key value
    pip 安装三方库报超时
    微信小程序滚动tab的实现
  • 原文地址:https://www.cnblogs.com/wevolf/p/12105538.html
Copyright © 2011-2022 走看看