zoukankan      html  css  js  c++  java
  • Vectorized implementation

    Vectorization

    Vectorization refers to a powerful way to speed up your algorithms. Numerical computing and parallel computing researchers have put decades of work into making certain numerical operations (such as matrix-matrix multiplication, matrix-matrix addition, matrix-vector multiplication) fast. The idea of vectorization is that we would like to express our learning algorithms in terms of these highly optimized operations.

    More generally, a good rule-of-thumb for coding Matlab/Octave is:

    Whenever possible, avoid using explicit for-loops in your code.

    A large part of vectorizing our Matlab/Octave code will focus on getting rid of for loops, since this lets Matlab/Octave extract more parallelism from your code, while also incurring less computational overhead from the interpreter.

    多用向量运算,别把向量拆成标量然后再循环

     

    Logistic Regression Vectorization Example

    Consider training a logistic regression model using batch gradient ascent. Suppose our hypothesis is

    egin{align}
h_	heta(x) = frac{1}{1+exp(-	heta^Tx)},
end{align}

    where we let 	extstyle x_0=1, so that 	extstyle x in Re^{n+1}and 	extstyle 	heta in Re^{n+1}, and 	extstyle 	heta_0 is our intercept term. We have a training set 	extstyle {(x^{(1)}, y^{(1)}), ldots, (x^{(m)}, y^{(m)})} of 	extstyle m examples, and the batch gradient ascent update rule is 	extstyle 	heta := 	heta + alpha 
abla_	heta ell(	heta), where 	extstyle ell(	heta) is the log likelihood and 	extstyle 
abla_	heta ell(	heta) is its derivative.

    We thus need to compute the gradient:

    egin{align}

abla_	heta ell(	heta) = sum_{i=1}^m left(y^{(i)} - h_	heta(x^{(i)}) 
ight) x^{(i)}_j.
end{align}

    Further, suppose the Matlab/Octave variable y is a row vector of the labels in the training set, so that the variable y(i) is 	extstyle y^{(i)} in {0,1}.

    Here's truly horrible, extremely slow, implementation of the gradient computation:

    % Implementation 1
    grad = zeros(n+1,1);
    for i=1:m,
      h = sigmoid(theta'*x(:,i));
      temp = y(i) - h; 
      for j=1:n+1,
        grad(j) = grad(j) + temp * x(j,i); 
      end;
    end;

    The two nested for-loops makes this very slow. Here's a more typical implementation, that partially vectorizes the algorithm and gets better performance:

    % Implementation 2 
    grad = zeros(n+1,1);
    for i=1:m,
      grad = grad + (y(i) - sigmoid(theta'*x(:,i)))* x(:,i);
    end;

    Neural Network Vectorization

    Forward propagation

    Consider a 3 layer neural network (with one input, one hidden, and one output layer), and suppose x is a column vector containing a single training example x^{(i)} in Re^{n}. Then the forward propagation step is given by:

    egin{align}
z^{(2)} &= W^{(1)} x + b^{(1)} \
a^{(2)} &= f(z^{(2)}) \
z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \
h_{W,b}(x) &= a^{(3)} = f(z^{(3)})
end{align}

    This is a fairly efficient implementation for a single example. If we have m examples, then we would wrap a for loop around this.

    % Unvectorized implementation
    for i=1:m, 
      z2 = W1 * x(:,i) + b1;
      a2 = f(z2);
      z3 = W2 * a2 + b2;
      h(:,i) = f(z3);
    end;

    For many algorithms, we will represent intermediate stages of computation via vectors. For example, z2, a2, and z3 here are all column vectors that're used to compute the activations of the hidden and output layers. In order to take better advantage of parallelism and efficient matrix operations, we would like to have our algorithm operate simultaneously on many training examples. Let us temporarily ignore b1 and b2 (say, set them to zero for now). We can then implement the following:

    % Vectorized implementation (ignoring b1, b2)
    z2 = W1 * x;
    a2 = f(z2);
    z3 = W2 * a2;
    h = f(z3)

    In this implementation, z2, a2, and z3 are all matrices, with one column per training example.

    A common design pattern in vectorizing across training examples is that whereas previously we had a column vector (such as z2) per training example, we can often instead try to compute a matrix so that all of these column vectors are stacked together to form a matrix. Concretely, in this example, a2 becomes a s2 by m matrix (where s2 is the number of units in layer 2 of the network, and m is the number of training examples). And, the i-th column of a2 contains the activations of the hidden units (layer 2 of the network) when the i-th training example x(:,i) is input to the network.

    % Inefficient, unvectorized implementation of the activation function
    function output = unvectorized_f(z)
    output = zeros(size(z))
    for i=1:size(z,1), 
      for j=1:size(z,2),
        output(i,j) = 1/(1+exp(-z(i,j)));
      end; 
    end;
    end
     
    % Efficient, vectorized implementation of the activation function
    function output = vectorized_f(z)
    output = 1./(1+exp(-z));     % "./" is Matlab/Octave's element-wise division operator. 
    end

    Finally, our vectorized implementation of forward propagation above had ignored b1 and b2. To incorporate those back in, we will use Matlab/Octave's built-in repmat function. We have:

    % Vectorized implementation of forward propagation
    z2 = W1 * x + repmat(b1,1,m);
    a2 = f(z2);
    z3 = W2 * a2 + repmat(b2,1,m);
    h = f(z3)

    repmat !!矩阵变形!!

     

    Backpropagation

    We are in a supervised learning setting, so that we have a training set { (x^{(1)}, y^{(1)}), ldots, (x^{(m)}, y^{(m)}) } of m training examples. (For the autoencoder, we simply set y(i) = x(i), but our derivation here will consider this more general setting.)

    we had that for a single training example (x,y), we can compute the derivatives as

    
egin{align}
delta^{(3)} &= - (y - a^{(3)}) ullet f'(z^{(3)}), \
delta^{(2)} &= ((W^{(2)})^Tdelta^{(3)}) ullet f'(z^{(2)}), \

abla_{W^{(2)}} J(W,b;x,y) &= delta^{(3)} (a^{(2)})^T, \

abla_{W^{(1)}} J(W,b;x,y) &= delta^{(2)} (a^{(1)})^T. 
end{align}

    Here, ullet denotes element-wise product. For simplicity, our description here will ignore the derivatives with respect to b(l), though your implementation of backpropagation will have to compute those derivatives too.

    gradW1 = zeros(size(W1));
    gradW2 = zeros(size(W2)); 
    for i=1:m,
      delta3 = -(y(:,i) - h(:,i)) .* fprime(z3(:,i)); 
      delta2 = W2'*delta3(:,i) .* fprime(z2(:,i));
     
      gradW2 = gradW2 + delta3*a2(:,i)';
      gradW1 = gradW1 + delta2*a1(:,i)'; 
    end;

    This implementation has a for loop. We would like to come up with an implementation that simultaneously performs backpropagation on all the examples, and eliminates this for loop.

    To do so, we will replace the vectors delta3 and delta2 with matrices, where one column of each matrix corresponds to each training example. We will also implement a function fprime(z) that takes as input a matrix z, and applies f'(cdot) element-wise.

     

     

    Sparse autoencoder

    When performing backpropagation on a single training example, we had taken into the account the sparsity penalty by computing the following:

    egin{align}
delta^{(2)}_i =
  left( left( sum_{j=1}^{s_{2}} W^{(2)}_{ji} delta^{(3)}_j 
ight)
+ eta left( - frac{
ho}{hat
ho_i} + frac{1-
ho}{1-hat
ho_i} 
ight) 
ight) f'(z^{(2)}_i) .
end{align}

    也就是不要用循环一个样本一个样本的去更新参数,而是要将样本组织成矩阵的形式,应用矩阵运算,提高效率。

  • 相关阅读:
    事件基础
    DOM
    GoWeb编程之多路复用
    GoWeb编程之HelloWorld
    Linux libtins 库安装教程
    模式串匹配KMP详解
    树的重心
    Light OJ 1064
    Light OJ 1060
    1057
  • 原文地址:https://www.cnblogs.com/sprint1989/p/3979458.html
Copyright © 2011-2022 走看看