Sparse Autoencoder（二） - 走看看

zoukankan html css js c++ java

Sparse Autoencoder（二）

Gradient checking and advanced optimization

In this section, we describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure described here will significantly increase your confidence in the correctness of your code.

Suppose we want to minimize $extstyle J( heta)$ as a function of $extstyle heta$ . For this example, suppose $extstyle J : Re mapsto Re$ , so that $extstyle heta in Re$ . In this 1-dimensional case, one iteration of gradient descent is given by

$egin{align} heta := heta - alpha frac{d}{d heta}J( heta). end{align}$

Suppose also that we have implemented some function $extstyle g( heta)$ that purportedly computes $extstyle frac{d}{d heta}J( heta)$ , so that we implement gradient descent using the update $extstyle heta := heta - alpha g( heta)$ .

Recall the mathematical definition of the derivative as

$egin{align} frac{d}{d heta}J( heta) = lim_{epsilon ightarrow 0} frac{J( heta+ epsilon) - J( heta-epsilon)}{2 epsilon}. end{align}$

Thus, at any specific value of $extstyle heta$ , we can numerically approximate the derivative as follows:

$egin{align} frac{J( heta+{ m EPSILON}) - J( heta-{ m EPSILON})}{2 imes { m EPSILON}} end{align}$

Thus, given a function $extstyle g( heta)$ that is supposedly computing $extstyle frac{d}{d heta}J( heta)$ , we can now numerically verify its correctness by checking that

$egin{align} g( heta) approx frac{J( heta+{ m EPSILON}) - J( heta-{ m EPSILON})}{2 imes { m EPSILON}}. end{align}$

The degree to which these two values should approximate each other will depend on the details of $extstyle J$ . But assuming $extstyle { m EPSILON} = 10^{-4}$ , you'll usually find that the left- and right-hand sides of the above will agree to at least 4 significant digits (and often many more).

Suppose we have a function $extstyle g_i( heta)$ that purportedly computes $extstyle frac{partial}{partial heta_i} J( heta)$ ; we'd like to check if $extstyle g_i$ is outputting correct derivative values. Let $extstyle heta^{(i+)} = heta + { m EPSILON} imes vec{e}_i$ , where

$egin{align} vec{e}_i = egin{bmatrix}0 \ 0 \ vdots \ 1 \ vdots \ 0end{bmatrix} end{align}$

is the $extstyle i$ -th basis vector (a vector of the same dimension as $extstyle heta$ , with a "1" in the $extstyle i$ -th position and "0"s everywhere else). So, $extstyle heta^{(i+)}$ is the same as $extstyle heta$ , except its $extstyle i$ -th element has been incremented by EPSILON. Similarly, let $extstyle heta^{(i-)} = heta - { m EPSILON} imes vec{e}_i$ be the corresponding vector with the $extstyle i$ -th element decreased by EPSILON. We can now numerically verify $extstyle g_i( heta)$ 's correctness by checking, for each $extstyle i$ , that:

$egin{align} g_i( heta) approx frac{J( heta^{(i+)}) - J( heta^{(i-)})}{2 imes { m EPSILON}}. end{align}$

参数为向量，为了验证每一维的计算正确性，可以控制其他变量

When implementing backpropagation to train a neural network, in a correct implementation we will have that

$egin{align} abla_{W^{(l)}} J(W,b) &= left( frac{1}{m} Delta W^{(l)} ight) + lambda W^{(l)} \ abla_{b^{(l)}} J(W,b) &= frac{1}{m} Delta b^{(l)}. end{align}$

This result shows that the final block of psuedo-code in Backpropagation Algorithm is indeed implementing gradient descent. To make sure your implementation of gradient descent is correct, it is usually very helpful to use the method described above to numerically compute the derivatives of $extstyle J(W,b)$ , and thereby verify that your computations of $extstyle left(frac{1}{m}Delta W^{(l)} ight) + lambda W$ and $extstyle frac{1}{m}Delta b^{(l)}$ are indeed giving the derivatives you want.

Autoencoders and Sparsity

Anautoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses $extstyle y^{(i)} = x^{(i)}$ .

Here is an autoencoder:

we will write $extstyle a^{(2)}_j(x)$ to denote the activation of this hidden unit when the network is given a specific input $extstyle x$ . Further, let

$egin{align} hat ho_j = frac{1}{m} sum_{i=1}^m left[ a^{(2)}_j(x^{(i)}) ight] end{align}$

be the average activation of hidden unit $extstyle j$ (averaged over the training set). We would like to (approximately) enforce the constraint

$egin{align} hat ho_j = ho, end{align}$

where $extstyle ho$ is a sparsity parameter, typically a small value close to zero (say $extstyle ho = 0.05$ ). In other words, we would like the average activation of each hidden neuron $extstyle j$ to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.

To achieve this, we will add an extra penalty term to our optimization objective that penalizes $extstyle hat ho_j$ deviating significantly from $extstyle ho$ . Many choices of the penalty term will give reasonable results. We will choose the following:

$egin{align} sum_{j=1}^{s_2} ho log frac{ ho}{hat ho_j} + (1- ho) log frac{1- ho}{1-hat ho_j}. end{align}$

Here, $extstyle s_2$ is the number of neurons in the hidden layer, and the index $extstyle j$ is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

$egin{align} sum_{j=1}^{s_2} { m KL}( ho || hat ho_j), end{align}$

Our overall cost function is now

$egin{align} J_{ m sparse}(W,b) = J(W,b) + eta sum_{j=1}^{s_2} { m KL}( ho || hat ho_j), end{align}$

where $extstyle J(W,b)$ is as defined previously, and $extstyle eta$ controls the weight of the sparsity penalty term. The term $extstyle hat ho_j$ (implicitly) depends on $extstyle W,b$ also, because it is the average activation of hidden unit $extstyle j$ , and the activation of a hidden unit depends on the parameters $extstyle W,b$ .

$egin{align} delta^{(2)}_i = left( left( sum_{j=1}^{s_{2}} W^{(2)}_{ji} delta^{(3)}_j ight) + eta left( - frac{ ho}{hat ho_i} + frac{1- ho}{1-hat ho_i} ight) ight) f'(z^{(2)}_i) . end{align}$

Visualizing a Trained Autoencoder

Consider the case of training an autoencoder on $extstyle 10 imes 10$ images, so that $extstyle n = 100$ . Each hidden unit $extstyle i$ computes a function of the input:

$egin{align} a^{(2)}_i = fleft(sum_{j=1}^{100} W^{(1)}_{ij} x_j + b^{(1)}_i ight). end{align}$

We will visualize the function computed by hidden unit $extstyle i$ ---which depends on the parameters $extstyle W^{(1)}_{ij}$ (ignoring the bias term for now)---using a 2D image. In particular, we think of $extstyle a^{(2)}_i$ as some non-linear feature of the input $extstyle x$

If we suppose that the input is norm constrained by $extstyle ||x||^2 = sum_{i=1}^{100} x_i^2 leq 1$ , then one can show (try doing this yourself) that the input which maximally activates hidden unit $extstyle i$ is given by setting pixel $extstyle x_j$ (for all 100 pixels, $extstyle j=1,ldots, 100$ ) to

$egin{align} x_j = frac{W^{(1)}_{ij}}{sqrt{sum_{j=1}^{100} (W^{(1)}_{ij})^2}}. end{align}$

By displaying the image formed by these pixel intensity values, we can begin to understand what feature hidden unit $extstyle i$ is looking for.

对一幅图像进行Autoencoder ，前面的隐藏结点一般捕获的是边缘等初级特征，越靠后隐藏结点捕获的特征语义更深。

查看全文

相关阅读:
图文详解AO打印（端桥模式）
ubuntu svn下载代码出错
 zip error: Invalid command arguments
秒杀系统设计与实现
 聊聊技术选型
 分布式事务，第三方接口一致性问题
 单系统下的分布式数据库事务方案（拓展spring的事务管理器）AgileBPM多数据的解决方案
 activiti flowable 开源工作流引擎项目整合开发实施实践总结
 java map循环的最优写法（之前写过好多种，这个听说最好记住就行）
eclipse中maven项目交付svn忽略配置文件（转）

原文地址：https://www.cnblogs.com/sprint1989/p/3979296.html

Copyright © 2011-2022 走看看