zoukankan      html  css  js  c++  java
  • Autoencoders and Sparsity(一)

    An autoencoder neural network is an unsupervised learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. I.e., it uses 	extstyle y^{(i)} = x^{(i)}.

    Here is an autoencoder:

    Autoencoder636.png

    The autoencoder tries to learn a function 	extstyle h_{W,b}(x) approx x. In other words, it is trying to learn an approximation to the identity function, so as to output 	extstyle hat{x} that is similar to 	extstyle x. The identity function seems a particularly trivial function to be trying to learn; but by placing constraints on the network, such as by limiting the number of hidden units, we can discover interesting structure about the data.

    例子&用途

    As a concrete example, suppose the inputs 	extstyle x are the pixel intensity values from a 	extstyle 10 	imes 10 image (100 pixels) so 	extstyle n=100, and there are 	extstyle s_2=50 hidden units in layer 	extstyle L_2. Note that we also have 	extstyle y in Re^{100}. Since there are only 50 hidden units, the network is forced to learn a compressed representation of the input. I.e., given only the vector of hidden unit activations 	extstyle a^{(2)} in Re^{50}, it must try to reconstruct the 100-pixel input 	extstyle x. If the input were completely random---say, each 	extstyle x_i comes from an IID Gaussian independent of the other features---then this compression task would be very difficult. But if there is structure in the data, for example, if some of the input features are correlated, then this algorithm will be able to discover some of those correlations. In fact, this simple autoencoder often ends up learning a low-dimensional representation very similar to PCAs

    约束

    Our argument above relied on the number of hidden units 	extstyle s_2 being small. But even when the number of hidden units is large (perhaps even greater than the number of input pixels), we can still discover interesting structure, by imposing other constraints on the network. In particular, if we impose a sparsity constraint on the hidden units, then the autoencoder will still discover interesting structure in the data, even if the number of hidden units is large.

    Recall that 	extstyle a^{(2)}_j denotes the activation of hidden unit 	extstyle j in the autoencoder. However, this notation doesn't make explicit what was the input 	extstyle x that led to that activation. Thus, we will write 	extstyle a^{(2)}_j(x) to denote the activation of this hidden unit when the network is given a specific input 	extstyle x. Further, let

    egin{align}
hat
ho_j = frac{1}{m} sum_{i=1}^m left[ a^{(2)}_j(x^{(i)}) 
ight]
end{align}

    be the average activation of hidden unit 	extstyle j (averaged over the training set). We would like to (approximately) enforce the constraint

    egin{align}
hat
ho_j = 
ho,
end{align}

    where 	extstyle 
ho is a sparsity parameter, typically a small value close to zero (say 	extstyle 
ho = 0.05). In other words, we would like the average activation of each hidden neuron 	extstyle j to be close to 0.05 (say). To satisfy this constraint, the hidden unit's activations must mostly be near 0.

    To achieve this, we will add an extra penalty term to our optimization objective   that penalizes 	extstyle hat
ho_j deviating significantly from 	extstyle 
ho. Many choices of the penalty term will give reasonable results. We will choose the following:

    egin{align}
sum_{j=1}^{s_2} 
ho log frac{
ho}{hat
ho_j} + (1-
ho) log frac{1-
ho}{1-hat
ho_j}.
end{align}

    Here, 	extstyle s_2 is the number of neurons in the hidden layer, and the index 	extstyle j is summing over the hidden units in our network. If you are familiar with the concept of KL divergence, this penalty term is based on it, and can also be written

    egin{align}
sum_{j=1}^{s_2} {
m KL}(
ho || hat
ho_j),
end{align}

    where 	extstyle {
m KL}(
ho || hat
ho_j)
 = 
ho log frac{
ho}{hat
ho_j} + (1-
ho) log frac{1-
ho}{1-hat
ho_j} is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean 	extstyle 
ho and a Bernoulli random variable with mean 	extstyle hat
ho_j. KL-divergence is a standard function for measuring how different two different distributions are.

     偏离,惩罚

    损失函数

    无稀疏约束时网络的损失函数表达式如下:

    带稀疏约束的损失函数如下:

    egin{align}
J_{
m sparse}(W,b) = J(W,b) + eta sum_{j=1}^{s_2} {
m KL}(
ho || hat
ho_j),
end{align}

    where 	extstyle J(W,b) is as defined previously, and 	extstyle eta controls the weight of the sparsity penalty term. The term 	extstyle hat
ho_j (implicitly) depends on 	extstyle W,b also, because it is the average activation of hidden unit 	extstyle j, and the activation of a hidden unit depends on the parameters 	extstyle W,b.

    损失函数的偏导数的求法

    而加入了稀疏性后,神经元节点的误差表达式由公式:

    变成公式:

    梯度下降法求解

    有了损失函数及其偏导数后就可以采用梯度下降法来求网络最优化的参数了,整个流程如下所示:

    从上面的公式可以看出,损失函数的偏导其实是个累加过程,每来一个样本数据就累加一次。这是因为损失函数本身就是由每个训练样本的损失叠加而成的,而按照加法的求导法则,损失函数的偏导也应该是由各个训练样本所损失的偏导叠加而成。从这里可以看出,训练样本输入网络的顺序并不重要,因为每个训练样本所进行的操作是等价的,后面样本的输入所产生的结果并不依靠前一次输入结果(只是简单的累加而已,而这里的累加是顺序无关的)。

    转自:http://www.cnblogs.com/tornadomeet/archive/2013/03/19/2970101.html

  • 相关阅读:
    git branch用法总结
    vue-router异步加载组件
    vue错误提示 Cannot read property 'beforeRouteEnter' of undefined,刷新后跳到首页
    websocket常见错误
    Websocket原理
    怎么在overflow-y:sroll的情况下 隐藏滚动条
    URI和URL有什么区别
    确定浏览器是否支持某些DOM模块
    将nodeList转换为数组(兼容性)
    软件的三种版本
  • 原文地址:https://www.cnblogs.com/sprint1989/p/3969857.html
Copyright © 2011-2022 走看看