zoukankan      html  css  js  c++  java
  • A strategy to quantify embedding layer

    A strategy to quantify embedding layer

    Basic idea

    Embedding is mainly in the process of word pre-training. Two embedding methods, word2vec and GloVe, are commonly used. Generally speaking, the calculation matrix size of embedding is (V imes h) where, (V) is the size of the one-hot vector, (h) is the size of the vector after embedding. For a slightly larger corpus, the parameters of this process are very large, the main reason is that the (V) is too large. The main idea is to not use one-hot vector to represent words, but to use a code (C_w) to represent, the way to express is:

    [C_{w}=left(C_{w}^{1}, C_{w}^{2}, ldots, C_{w}^{M} ight) ]

    That is, the dimension of the word becomes the (M) dimension, where (C_w^i in [1,K]) , Therefore, (C_w^i) can essentially be regarded as a one-hot vector of (K) dimension, and (C_w) is a collection of one-hot vectors. At this time, if we want to embedding the word vector C, we need a matrix, which is (E_1, E_2, dots, E_M).

    For example

    if we have (C_{dog} = (3, 2, 4, 1)) and (C_{dogs} = (3, 2, 4, 2)) , in this condition, (K = 4) and (M=4), (E_1 = {e_{11}, e_{12}, e_{13}, e_{14}}) (E_2 = {e_{21}, e_{22}, e_{23}, e_{24}}) and (dots) (E_4) , Among them, we need to know that the dimension of (e_{ij}) is (1 imes H) , and the process of embedding is :

    [Eleft(C_{dog} ight)=sum_{i=1}^{M} E_{i}left(C_{w}^{i} ight)=E_1(3) + E_2(2)+E_3(4)+E_4(1) = e_{13}+e_{22}+e_{34}+e_{41} ag{1} ]

    So the matrix of the embedding process is (M imes K imes h)

    Unknown parameters

    Then for a word, we want to embedding it, then the parameters we need to know are (C) and (E_1, E_2, dots, E_M) and we usually call (E) as codebooks .we wish to find a set of codes (hat C) and combined codebook (hat E) that can produce the embeddings with the same effectiveness as ( ilde E(w)), Among them, ( ilde E(w)) represents the original embedding method, such as GloVe. At this step, it is clear that we can use the loss function method to obtain the parameter matrices (C) and (E) , Then we get the following formula:

    [egin{aligned} (hat{C}, hat{E}) &=underset{C, E}{operatorname{argmin}} frac{1}{|V|} sum_{w in V}left|Eleft(C_{w} ight)- ilde{E}(w) ight|^{2} \ &=underset{C, E}{operatorname{argmin}} frac{1}{|V|} sum_{w in V}left|sum_{i=1}^{M} E_{i}left(C_{w}^{i} ight)- ilde{E}(w) ight|^{2} end{aligned} ag{2} ]

    To learn the codebooks (E) and code assignment (C) . In this work, the author propose a straight-forward method to directly learn the code assignment and codebooks simutaneously in an end-to-end neural network. In this work, the author encourage the discreteness with the Gumbel-Softmax trick for producing compositional codes.

    Why Gumbel-Softmax

    As mentioned above, the way we obtain codebooks and code assignments is to use neural networks, but code assignments are discrete variables. In the process of back propagation of neural networks, we are all using continuous variables. For discrete variables In the process of back propagation of the neural network and the calculation of gradient descent, it needs to be introduced Gumbel-Softmax.

    Generally speaking, for the discrete variable (z), it is assumed to be an on-hot vector with a distribution of (oldsymbol{pi}=left[pi_{1}; pi_{2}; ldots; pi_ {k} ight]) Then the process of forward propagation is (pi longrightarrow z) to get a one-hot of (z), but the problem is that we cannot directly use (z longrightarrow pi) in the process of back propagation , because in the process of (pi longrightarrow z) we usually use function argmax.

    Explainable cause

    1. Sampling process for discrete variables, using distributed sampling for intermediate calculation results, you can pan the results of softmax
    2. Use a derivative function instead of argmax to represent the probability distribution

    The corresponding two formulas are:

    [z=operatorname{one}_{-} operatorname{hot}left(operatorname{argmax}_{i}left[g_{i}+log pi_{i} ight] ight) \ y_{i}=frac{left.exp left(log left(pi_{i} ight)+g_{i} ight) / au ight)}{left.sum_{j=1}^{k} exp left(log left(pi_{j} ight)+g_{j} ight) / au ight)} quad ext { for } quad i=1, ldots, k ]

    In fact, this part of my own understanding is not very clear, and I will use blogs to explain it later.

    Learning with Neural Network

    First, we need to know that the input of the neural network is a one-hot vector of (V), then what we need to do is Minimize the loss function (2), In this process, the optimized parameters are obtained, (C) and (E) .

    Let’s first look at the structure of the network, and then explain what each step does. Then we know the relationship between the parameters in the network and the unknown (C) and (E).

    In this network, the first layer is (hat E(w)) , which is a traditional embedding layer, such as GloVe, then ( ilde{mathbf{E}} in mathbb{R}^{|V| imes MK/2}) , we get (h_w in mathbb{R}^{MK/2 imes 1}) , and we get (h_w) and (alpha_w^i) with:

    [egin{array}{l} oldsymbol{h}_{oldsymbol{w}}= anh left({ heta}^{ op} ilde{E}(w)+{b} ight) \ oldsymbol{alpha}_{oldsymbol{w}}^{i}=operatorname{softplus}left({ heta}_{{i}}^{prime op} {h}_{{w}}+{b}_{{i}}^{prime} ight) ag{3} end{array} ]

    and (alpha_w in mathbb{R}^{M imes k}) , the next step is Gumbel-Softmax, and we get:

    [egin{aligned} left({d}_{{w}}^{{i}} ight)_{k} &=operatorname{softmax}_{oldsymbol{ au}}left(log oldsymbol{alpha}_{oldsymbol{w}}^{oldsymbol{i}}+G ight)_{k} \ &=frac{exp left(left(log left(oldsymbol{alpha}_{oldsymbol{w}}^{oldsymbol{i}} ight)_{k}+G_{k} ight) / au ight)}{sum_{k^{prime}=1}^mathbb{K} exp left(left(log left(oldsymbol{alpha}_{oldsymbol{w}}^{i} ight)_{k^{prime}}+G_{k^{prime}} ight) / au ight)} end{aligned} ag{4} ]

    In the above process, we convert the problem of learning discrete codes (C_w) to a problem of finding a set of optimal one-hot vectors (d_w^1, dots, d_w^M). and (d_w^i in mathbb{R}^{K imes 1}) .

    and the next step, we use a matrix (E) with (C_w) to get (E(C_w)). In the neural network, we use (A_1, dots, A_M) to donate the matrix (E_1, E_2, dots, E_M). Therefore:

    [Eleft(C_{w} ight)=sum_{i=0}^{M} {A}_{{i}}^{ op} {d}_{{w}}^{{i}} ag{5} ]

    and (oldsymbol{A}_{oldsymbol{i}} in mathbb{R}^{K imes H}) , and the result is in (mathbb{R}^{H imes 1}) , So ,we can use the formula(2) to calculate the loss function and back propagation.

    Get Code from Neural Network

    In the above neural network, the main parameters in our training process are (left({ heta}, {b}, { heta}^{prime}, {b}^{prime}, {A} ight)) , Once the coding learning model is trained, the code (C_w) for each word can be easily obtained by applying argmax to the one-hot vectors (d_w^1, dots, d_w^M). The basis vectors (codewords) for composing the embeddings can be found as the row vectors in the weight matrix (A) which just the same as (E_1, E_2, dots, E_M).

    So this process is equivalent to simplifying the embedding layer. It is similar to knowledge distillation in training methods. In terms of training results, it is mainly to quantify the one-hot vector of the original embedding model. Therefore, using this embedding layer for downstream tasks will increase the model calculation speed and compress the size of the model.

    Where we compress the model

    Finally, let's explain where the model compresses the original embedding layer, first for a (C_{w}=left(C_{w}^{1}, C_{w}^{2}, ldots , C_{w}^{M} ight)), Assuming that the size of (C_w) in the original model is (V in mathbb{R}^{N}), now it is (C_w in mathbb{R}^{M imes K}) , Then, we use binary encoding for each (C_w^i) , we get the length of vector in (C_w) is (M log _{2} K) , but in original model, the (C_w) is (V in mathbb{R}^{N}), we could chose the appropriate (M) and (K) to increase the ratio of data compression. On the other hand, in the process of embedding, the size of the embedding matrix will also be reduced, and the size after the reduction is (M imes K imes H).

  • 相关阅读:
    gulp教程、gulp-less安装
    vue学习总结
    javascript数组去重
    【操作系统】操作系统高频面试考点总结
    【面经系列】一线互联网大厂前端面试技巧深入浅出总结
    【编程题与分析题】Javascript 之继承的多种实现方式和优缺点总结
    【计算机网络】TCP基础知识详解
    【操作系统】操作系统面试基础知识点总结
    【数据结构与算法】数据结构基础知识总结(面试考点)
    【前端知识体系-JS相关】JS-Web-API总结
  • 原文地址:https://www.cnblogs.com/wevolf/p/13091540.html
Copyright © 2011-2022 走看看