zoukankan      html  css  js  c++  java
  • CS229 6.5 Neurons Networks Implements of Sparse Autoencoder

    sparse autoencoder的一个实例练习,这个例子所要实现的内容大概如下:从给定的很多张自然图片中截取出大小为8*8的小patches图片共10000张,现在需要用sparse autoencoder的方法训练出一个隐含层网络所学习到的特征。该网络共有3层,输入层是64个节点,隐含层是25个节点,输出层当然也是64个节点了。

    main函数,  分五步走,每个函数的实现细节在下边都列出了。

      1 %%======================================================================
      2 %% STEP 0: Here we provide the relevant parameters values that will
      3 %  allow your sparse autoencoder to get good filters; you do not need to
      4 %  change the parameters below.
      5  
      6 visibleSize = 8*8;   % number of input units
      7 hiddenSize = 25;     % number of hidden units
      8 sparsityParam = 0.01;   % desired average activation of the hidden units.
      9                      % (This was denoted by the Greek alphabet rho,
     10                      % which looks like a lower-case "p",
     11              %  in the lecture notes).
     12 lambda = 0.0001;     % weight decay parameter      
     13 beta = 3;            % weight of sparsity penalty term      
     14  
     15 %%======================================================================
     16 %% STEP 1: Implement sampleIMAGES
     17 %
     18 %  After implementing sampleIMAGES, the display_network command should
     19 %  display a random sample of 200 patches from the dataset
     20 patches = sampleIMAGES;
     21 display_network(patches(:,randi(size(patches,2),200,1)),8);
     22  
     23  
     24 %  Obtain random parameters theta
     25 theta = initializeParameters(hiddenSize, visibleSize);
     26  
     27 %%======================================================================
     28 %% STEP 2: Implement sparseAutoencoderCost
     29 %
     30 %  You can implement all of the components (squared error cost, weight decay term,
     31 %  sparsity penalty) in the cost function at once, but it may be easier to do
     32 %  it step-by-step and run gradient checking (see STEP 3) after each step.  We
     33 %  suggest implementing the sparseAutoencoderCost function using the following steps:
     34 %
     35 %  (a) Implement forward propagation in your neural network, and implement the
     36 %      squared error term of the cost function.  Implement backpropagation to
     37 %      compute the derivatives.   Then (using lambda=beta=0), run Gradient Checking
     38 %      to verify that the calculations corresponding to the squared error cost
     39 %      term are correct.
     40 %
     41 %  (b) Add in the weight decay term (in both the cost function and the derivative
     42 %      calculations), then re-run Gradient Checking to verify correctness.
     43 %
     44 %  (c) Add in the sparsity penalty term, then re-run Gradient Checking to
     45 %      verify correctness.
     46 %
     47 %  Feel free to change the training settings when debugging your
     48 %  code.  (For example, reducing the training set size or
     49 %  number of hidden units may make your code run faster; and setting beta
     50 %  and/or lambda to zero may be helpful for debugging.)  However, in your
     51 %  final submission of the visualized weights, please use parameters we
     52 %  gave in Step 0 above.
     53  
     54 [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
     55                                     lambda,sparsityParam, beta, patches);
     56  
     57 %%======================================================================
     58 %% STEP 3: Gradient Checking
     59 %
     60 % Hint: If you are debugging your code, performing gradient checking on smaller models
     61 % and smaller training sets (e.g., using only 10 training examples and 1-2 hidden
     62 % units) may speed things up.
     63  
     64 % First, lets make sure your numerical gradient computation is correct for a
     65 % simple function.  After you have implemented computeNumericalGradient.m,
     66 % run the following:
     67 checkNumericalGradient();
     68  
     69 % Now we can use it to check your cost function and derivative calculations
     70 % for the sparse autoencoder. 
     71 numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ...
     72                         hiddenSize, lambda,sparsityParam, beta, patches), theta);
     73  
     74 % Use this to visually compare the gradients side by side
     75 disp([numgrad grad]);
     76  
     77 % Compare numerically computed gradients with the ones obtained from backpropagation
     78 diff = norm(numgrad-grad)/norm(numgrad+grad);
     79 disp(diff); % Should be small. In our implementation, these values are
     80             % usually less than 1e-9.
     81             % When you got this working, Congratulations!!!
     82  
     83 %%======================================================================
     84 %% STEP 4: After verifying that your implementation of
     85 %  sparseAutoencoderCost is correct, You can start training your sparse
     86 %  autoencoder with minFunc (L-BFGS).
     87  
     88 %  Randomly initialize the parameters
     89 theta = initializeParameters(hiddenSize, visibleSize);
     90  
     91 %  Use minFunc to minimize the function
     92 addpath minFunc/
     93 options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost
     94                           % function. Generally, for minFunc to work, you
     95                           % need a function pointer with two outputs: the
     96                           % function value and the gradient. In our problem,
     97                           % sparseAutoencoderCost.m satisfies this.
     98 options.maxIter = 400;    % Maximum number of iterations of L-BFGS to run
     99 options.display = 'on';
    100 [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p,visibleSize, hiddenSize, ...
    101                             lambda, sparsityParam, beta, patches),theta, options);
    102 %%======================================================================
    103 %% STEP 5: Visualization
    104  
    105 W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
    106 display_network(W1', 12);
    107  
    108 print -djpeg weights.jpg   % save the visualization to a file
    109  
    110 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    111 %三个函数(sampleIMAGES)(normalizeData)(initializeParameters)%%%%
    112 function patches = sampleIMAGES()
    113 load IMAGES;    % 加载初始的10张512*512大图片
    114  
    115 patchsize = 8;  % 采样大小
    116 numpatches = 10000;
    117  
    118 %  初始化该矩阵为0,该矩阵为 64*10000维每一列为一张图片.
    119 patches = zeros(patchsize*patchsize, numpatches);
    120    
    121 %  IMAGES 为一个包含10 张images的三维数组,IMAGES(:,:,6) 是一个第六张图片的 512x512 的二维数组,
    122 %  命令 "imagesc(IMAGES(:,:,6)), colormap gray;" 可以把第六张图可视化.
    123 % 这几张图是经过whiteing预处理的?
    124 %  IMAGES(21:30,21:30,1) 就是从第一张图采样得到的(21,21) to (30,30) 的小patchs
    125  
    126 %在每张图片中随机选取1000个patch,共10000个patch
    127 for imageNum = 1:10
    128     [rowNum colNum] = size(IMAGES(:,:,imageNum));
    129     %实现每张图片选取1000个patch
    130     for patchNum = 1:1000
    131         %得到左上角的两个点
    132         xPos = randi([1,rowNum-patchsize+1]);
    133         yPos = randi([1, colNum-patchsize+1]);
    134         %填充到矩阵里
    135         patches(:,(imageNum-1)*1000+patchNum) = ...
    136             reshape(IMAGES(xPos:xPos+7,yPos:yPos+7,imageNum),64,1);
    137     end
    138 end
    139 %由于autoencoder的激励函数是sigmod函数,输出值限定在[0,1],故为了达到H W,b(x)= x,x作为输入,
    140 %也要限定在0-1之间,故需要进行正则化
    141 patches = normalizeData(patches);
    142 end
    143  
    144 % 正则化的函数,不太明白s-sigma法则?
    145 function patches = normalizeData(patches)
    146 % 减去均值
    147 patches = bsxfun(@minus, patches, mean(patches));
    148 % s = std(X),此处X是一个矢量,该函数返回标准偏差(注意其分母为n-1,而不是n) 。
    149 % 结果s是一个X各样本偏差无偏估计的平方根(X包含独立的、同分布样本)。
    150 % 如果X是一个矩阵,该函数返回一个行矢量,它包含了X每列元素的标准偏差。
    151 pstd = 3 * std(patches(:));
    152 patches = max(min(patches, pstd), -pstd) / pstd;
    153 % 重新压缩 从[-1,1] 到 [0.1,0.9]
    154 patches = (patches + 1) * 0.4 + 0.1;
    155 end
    156  
    157 %首先初始化参数
    158 function theta = initializeParameters(hiddenSize, visibleSize)
    159 % Initialize parameters randomly based on layer sizes.
    160  % we'll choose weights uniformly from the interval [-r, r]
    161 r  = sqrt(6) / sqrt(hiddenSize+visibleSize+1);
    162 %rand(a,b)产生均匀分布的随机矩阵维度为a*b,元素取值范围0.01.0163 W1 = rand(hiddenSize, visibleSize) * 2 * r - r;
    164 %rand(a,b)*2*r即取值范围为(0-2r), rand(a,b)*2*r -r即取值范围为(-r - r)
    165 W2 = rand(visibleSize, hiddenSize) * 2 * r - r;
    166 b1 = zeros(hiddenSize, 1); %连接到hidden unit的偏置单元
    167 b2 = zeros(visibleSize, 1); %链接到output layer的偏置单元
    168 %  将矩阵合并为一个向量
    169 theta = [W1(:) ; W2(:) ; b1(:) ; b2(:)];
    170 %初始化参数结束
    171 end
    172  
    173  
    174  
    175 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step 2 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    176 %%%%%返回稀疏损失函数的值与梯度值%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    177 function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
    178                                         lambda, sparsityParam, beta, data)
    179 % visibleSize: 输入层单元数
    180 % hiddenSize: 隐藏单元数
    181 % lambda: 正则项
    182 % sparsityParam: (p)指定的平均激活度p
    183 % beta: 稀疏权重项B
    184 % data: 64x10000 的矩阵为training data,data(:,i)  是第i个训练样例.  
    185 % 把参数拼接为一个向量,因为采用L-BFGS优化,L-BFGS要求的就是向量.
    186 % 将长向量转换成每一层的权值矩阵和偏置向量值
    187 % theta向量的的 1->hiddenSize*visibleSize,W1共hiddenSize*visibleSize 个元素,重新作为矩阵
    188 W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
    189  
    190 %类似以上一直往后放
    191 W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
    192 b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
    193 b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
    194  
    195 % 参数对应的梯度矩阵 ;
    196 cost = 0;
    197 W1grad = zeros(size(W1));
    198 W2grad = zeros(size(W2));
    199 b1grad = zeros(size(b1));
    200 b2grad = zeros(size(b2));
    201  
    202 Jcost = 0;  %直接误差
    203 Jweight = 0;%权值惩罚
    204 Jsparse = 0;%稀疏性惩罚
    205 [n m] = size(data); %m为样本的个数,n为样本的特征数
    206  
    207 %前向算法计算各神经网络节点的线性组合值和active值
    208 %W1为 hiddenSize*visibleSize的矩阵
    209 %data为 visibleSize* trainexampleNum的矩阵
    210 %remat(b1,1,m)把向量b1复制扩展为hiddenSize*m列
    211 % 根据公式 Z^(l) = z^(l-1)*W^(l-1)+b^(l-1)
    212 %z2保存的是10000个样本下隐藏层的输入,为hiddenSize*m维的矩阵,每一列代表一次输入
    213 z2= W1*data + remat(b1,1,m);%第二层的输入
    214 a2 = sigmoid(z2); %对z2取sigmod 即得到a2,即隐藏层的输出
    215 z3 = W2*a2+repmat(b2,1,m); %output layer 的输入
    216 a3 = sigmoid(z3); %output 层的输出
    217  
    218 % 计算预测产生的误差
    219 %对应J(W,b), 外边的sum是对所有样本求和,里边的sum是对输出层的所有分量求和
    220 Jcost = (0.5/m)*sum(sum((a3-data).^2));
    221 %计算权值惩罚项 正则化项,并没有带正则项参数
    222 Jweight = (1/2)*(sum(sum(W1.^2))+sum(sum(W2.^2)));
    223 %计算稀疏性规则项 sum(matrix,2)是进行按行求和运算,即所有样本在隐层的输出累加求均值
    224 % rho为一个hiddenSize*1 维的向量
    225  
    226 rho = (1/m).*sum(a2,2);%求出隐含层输出aj的平均值向量 rho为hiddenSize维的
    227 %求稀疏项的损失
    228 Jsparse = sum(sparsityParam.*log(sparsityParam./rho)+(1-sparsityParam).*log((1-sparsityParam)./(1-rho)));
    229 %损失函数的总表达式 损失项 + 正则化项 + 稀疏项
    230 cost = Jcost + lambda*Jweight + beta*Jsparse;
    231 %计算l = 3 即 output-layer层的误差dleta3,因为在autoencoder中输入等于输出h(W,b)=x
    232 delta3 = -(data-a3).*sigmoidInv(z3);
    233 %因为加入了稀疏规则项,所以计算偏导时需要引入该项,sterm为稀疏项,为hiddenSize维的向量
    234 sterm = beta*(-sparsityParam./rho+(1-sparsityParam)./(1-rho))
    235 % W2 为64*25的矩阵,d3为第三层的输出为64*10000的矩阵,每一列为每个样本x^(i)的输出,W2'为W2的转置
    236 % repmat(sterm,1,m)会把函数复制扩展为m列的矩阵,每一列都为sterm向量。
    237 % d2为hiddenSize*10000的矩阵
    238 delta2 = (W2'*delta3+repmat(sterm,1,m)).*sigmoidInv(z2);
    239  
    240 %计算W1grad
    241 % data'为10000*64的矩阵 d2*data' 位25*64的矩阵
    242 W1grad = W1grad+delta2*data';
    243 W1grad = (1/m)*W1grad+lambda*W1;
    244  
    245 %计算W2grad 
    246 W2grad = W2grad+delta3*a2';
    247 W2grad = (1/m).*W2grad+lambda*W2;
    248  
    249 %计算b1grad
    250 b1grad = b1grad+sum(delta2,2);
    251 b1grad = (1/m)*b1grad;%注意b的偏导是一个向量,所以这里应该把每一行的值累加起来
    252  
    253 %计算b2grad
    254 b2grad = b2grad+sum(delta3,2);
    255 b2grad = (1/m)*b2grad;
    256 %计算完成重新转为向量
    257 grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
    258 end
    259  
    260 %-------------------------------------------------------------------
    261 % Here's an implementation of the sigmoid function, which you may find useful
    262 % in your computation of the costs and the gradients.  This inputs a (row or
    263 % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).
    264  
    265 function sigm = sigmoid(x)
    266     sigm = 1 ./ (1 + exp(-x));
    267 end
    268  
    269 %sigmoid函数的导函数
    270 function sigmInv = sigmoidInv(x)
    271     sigmInv = sigmoid(x).*(1-sigmoid(x));
    272 end
    273  
    274 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step 3 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    275 %三个函数:(checkNumericalGradient)(simpleQuadraticFunction)(computeNumericalGradient)
    276 function [] = checkNumericalGradient()
    277 x = [4; 10];
    278 %当前简单函数实际的值与实际的导函数
    279 [value, grad] = simpleQuadraticFunction(x);
    280 % 在点 x 处计算简单函数的梯度,("@simpleQuadraticFunction" denotes a pointer to a function.)
    281 numgrad = computeNumericalGradient(@simpleQuadraticFunction, x);
    282 % disp()等价于 print()
    283 disp([numgrad grad]);
    284 fprintf('The above two columns you get should be very similar.
    (Left-Your Numerical Gradient, Right-Analytical Gradient)
    
    ');
    285 % norm 等价于 sqrt(sum(X.^2)); 如果实现正确,设置 EPSILON = 0.0001,误差应该为2.1452e-12
    286 diff = norm(numgrad-grad)/norm(numgrad+grad);
    287 disp(diff);
    288 fprintf('Norm of the difference between numerical and analytical gradient (should be < 1e-9)
    
    ');
    289 end
    290  
    291  %这个简单函数用来检验写的computeNumericalGradient函数的正确性
    292 function [value,grad] = simpleQuadraticFunction(x)
    293 % this function accepts a 2D vector as input.
    294 % Its outputs are:
    295 %   value: h(x1, x2) = x1^2 + 3*x1*x2
    296 %   grad: A 2x1 vector that gives the partial derivatives of h with respect to x1 and x2
    297 % Note that when we pass @simpleQuadraticFunction(x) to computeNumericalGradients, we're assuming
    298 % that computeNumericalGradients will use only the first returned value of this function.
    299 value = x(1)^2 + 3*x(1)*x(2);
    300 grad = zeros(2, 1);
    301 grad(1)  = 2*x(1) + 3*x(2);
    302 grad(2)  = 3*x(1);
    303 end
    304  
    305 %梯度检验的函数
    306 function numgrad = computeNumericalGradient(J, theta)
    307 % theta: 参数,向量或者实数均可
    308 % J: 输出值为实数的函数. 调用y = J(theta)将会返回函数在theta处的值
    309  
    310 % numgrad初始化为0,与theta维度相同
    311 numgrad = zeros(size(theta));
    312 EPSILON = 1e-4;
    313 % theta是一个行向量,size(theta,1)是求行数
    314 n = size(theta,1);
    315 %产生一个维度为n的单位矩阵
    316 E = eye(n);
    317 for i = 1:n
    318     % (n,:)代表第n行,所有的列
    319     % (:,n)代表所有行,第n列
    320     % 由于E是单位矩阵,所以只有第i行第i列的元素变为EPSILON
    321     delta = E(:,i)*EPSILON;
    322     %向量第i维度的值
    323     numgrad(i) = (J(theta+delta)-J(theta-delta))/(EPSILON*2.0);
    324 end
    325 %% ---------------------------------------------------------------
    326  
    327 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 对应step 5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    328 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%关于函数的展示%%%%%%%%%%%%%%%%%%%%%%%%%%%
    329 function [h, array] = display_network(A, opt_normalize, opt_graycolor, cols, opt_colmajor)
    330 % This function visualizes filters in matrix A. Each column of A is a
    331 % filter. We will reshape each column into a square image and visualizes
    332 % on each cell of the visualization panel.
    333 % All other parameters are optional, usually you do not need to worry
    334 % about it.
    335 % opt_normalize: whether we need to normalize the filter so that all of
    336 % them can have similar contrast. Default value is true.
    337 % opt_graycolor: whether we use gray as the heat map. Default is true.
    338 % cols: how many columns are there in the display. Default value is the
    339 % squareroot of the number of columns in A.
    340 % opt_colmajor: you can switch convention to row major for A. In that
    341 % case, each row of A is a filter. Default value is false.
    342 warning off all
    343  
    344 if ~exist('opt_normalize', 'var') || isempty(opt_normalize)
    345     opt_normalize= true;
    346 end
    347  
    348 if ~exist('opt_graycolor', 'var') || isempty(opt_graycolor)
    349     opt_graycolor= true;
    350 end
    351  
    352 if ~exist('opt_colmajor', 'var') || isempty(opt_colmajor)
    353     opt_colmajor = false;
    354 end
    355  
    356 % rescale
    357 A = A - mean(A(:));
    358  
    359 if opt_graycolor, colormap(gray); end
    360  
    361 % compute rows, cols
    362 [L M]=size(A);
    363 sz=sqrt(L);
    364 buf=1;
    365 if ~exist('cols', 'var')
    366     if floor(sqrt(M))^2 ~= M
    367         n=ceil(sqrt(M));
    368         while mod(M, n)~=0 && n<1.2*sqrt(M), n=n+1; end
    369         m=ceil(M/n);
    370     else
    371         n=sqrt(M);
    372         m=n;
    373     end
    374 else
    375     n = cols;
    376     m = ceil(M/n);
    377 end
    378  
    379 array=-ones(buf+m*(sz+buf),buf+n*(sz+buf));
    380  
    381 if ~opt_graycolor
    382     array = 0.1.* array;
    383 end
    384  
    385  
    386 if ~opt_colmajor
    387     k=1;
    388     for i=1:m
    389         for j=1:n
    390             if k>M,
    391                 continue;
    392             end
    393             clim=max(abs(A(:,k)));
    394             if opt_normalize
    395                 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
    396             else
    397                 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/max(abs(A(:)));
    398             end
    399             k=k+1;
    400         end
    401     end
    402 else
    403     k=1;
    404     for j=1:n
    405         for i=1:m
    406             if k>M,
    407                 continue;
    408             end
    409             clim=max(abs(A(:,k)));
    410             if opt_normalize
    411                 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz)/clim;
    412             else
    413                 array(buf+(i-1)*(sz+buf)+(1:sz),buf+(j-1)*(sz+buf)+(1:sz))=reshape(A(:,k),sz,sz);
    414             end
    415             k=k+1;
    416         end
    417     end
    418 end
    419  
    420 if opt_graycolor
    421     h=imagesc(array,'EraseMode','none',[-1 1]);
    422 else
    423     h=imagesc(array,'EraseMode','none',[-1 1]);
    424 end
    425 axis image off
    426  
    427 drawnow;
    428  
    429 warning on all
  • 相关阅读:
    Codeforces-541div2
    动态规划-线性dp-hdu-4055
    动态规划_线性dp
    动态规划_背包问题笔记
    codeforces-1111
    数论模板
    codeforces-1114F-线段树练习
    2-sat
    拓扑排序
    强连通分量
  • 原文地址:https://www.cnblogs.com/alan-blog-TsingHua/p/10023893.html
Copyright © 2011-2022 走看看