来自http://deeplearning.net/tutorial/mlp.html#mlp
Multilayer Perceptron
note:这部分假设读者已经通读之前的一个练习 Classifying MNIST digits using Logistic Regression.(http://blog.csdn.net/shouhuxianjian/article/details/46375461)。另外,它使用新的theano函数和概念: T.tanh, shared variables, basic arithmetic ops, T.grad, L1 and L2 regularization, floatX。如果你想要在GPU上运行代码,记得看GPU.
note:这部分的代码可以从这里下载here.
接下来要呈现的使用theano的架构是单隐藏层多层感知机(MLP)。一个MLP可以被视为一个逻辑回归分类器,其中的输入首先通过学到的非线性 来转换。该转换是将输入数据映射到一个空间中,在该空间中不同的类别可以线性可分。中间层也就是指隐藏层。一个隐藏层已经足够让MLPs成为一个通用的逼近器。然而我们随后看到的是在使用许多这样的隐藏层之后可以得到很大的好处,即深度学习的前提条件(指的是隐藏层必须超过一层)。可以看这些课程的笔记:ntroduction to MLPs, the back-propagation algorithm, and how to train MLPs.
该教程依然是在MNIST数字分类上来介绍的。
一、模型
有着单一隐藏层的MLP(或者人工神经网络,ANN)的图示如下:
正式的说,一层隐藏层MLP表示为函数形式: ,这里是输入向量 的size, 是输出向量的size,表示矩阵符号形式如下:
有着偏置向量 , ,权重矩阵 , ,激活函数 和s。
向量构成这个隐藏层。是连接输入向量到隐藏层之间的权重矩阵。每一列 表示输入单元到第 i 个隐藏单元的权重。对s 的选择通常是tanh: 或者逻辑sigmoid函数: 。我们在这个教程中将会使用tanh,因为它的训练速度一般可以更快(而且有时候有着更好的局部最小)。tanh和sigmoid都是标量to标量的函数,不过它们自然的扩展到向量和张量的时候都是逐元素计算的(例如,在向量的每个元素上独立计算,生成一个同样size的向量)。
输出向量计算结果为:。读者应该在前一个练习(Theano3.3-练习之逻辑回归)就该看过这个形式了,和之前一样,属于哪一类的概率可以通过选择 为softmax函数来计算得到(多类分类情况下)。
为了训练一个MLP,我们需要对这个模型的所有参数进行学习,这里我们使用带有minibatch的 Stochastic Gradient Descent 。需要学习的参数集就是: 。可以通过BP算法(导数链式规则的特殊情况)来得到梯度 。不过幸运的是,因为theano可以自动的进行求导微分,我们不需要在本教程中介绍如何求导
二、从LR到MLP
该教程关注的是单隐藏层的MLP。所以先编写单层隐藏层的类。为了构造这个MLP,我们随后只需要在顶部放上一个逻辑回归层就好:
class HiddenLayer(object): def __init__(self, rng, input, n_in, n_out, W=None, b=None, activation=T.tanh): """ Typical hidden layer of a MLP: units are fully-connected and have sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) and the bias vector b is of shape (n_out,). NOTE : The nonlinearity used here is tanh Hidden unit activation is given by: tanh(dot(input,W) + b) :type rng: numpy.random.RandomState :param rng: a random number generator used to initialize weights :type input: theano.tensor.dmatrix :param input: a symbolic tensor of shape (n_examples, n_in) :type n_in: int :param n_in: dimensionality of input :type n_out: int :param n_out: number of hidden units :type activation: theano.Op or function :param activation: Non linearity to be applied in the hidden layer """ self.input = input隐藏层 i 的权重的初始化值需要从依赖于激活函数的对称间隔上统一采样得到。对于tanh激活函数,在 [Xavier10] 中的获得的结果上来看,这个间隔应该是。这里 是第 -th层的单元个数, 是第-th层的单元个数。对于sigmoid函数来说,间隔是 。在训练的早期,这个初始化是可以确保每个神经元会在它的激活函数的变化较大的区域部分,使得能够很容易往上传播(从输入到输出方向)和往回传播(梯度从输出到输入方向):
# `W` is initialized with `W_values` which is uniformely sampled # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden)) # for tanh activation function # the output of uniform if converted using asarray to dtype # theano.config.floatX so that the code is runable on GPU # Note : optimal initialization of weights is dependent on the # activation function used (among other things). # For example, results presented in [Xavier10] suggest that you # should use 4 times larger initial weights for sigmoid # compared to tanh # We have no info for other function, so we use the same as # tanh. if W is None: W_values = numpy.asarray( rng.uniform( low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)), size=(n_in, n_out) ), dtype=theano.config.floatX ) if activation == theano.tensor.nnet.sigmoid: W_values *= 4 W = theano.shared(value=W_values, name='W', borrow=True) if b is None: b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) b = theano.shared(value=b_values, name='b', borrow=True) self.W = W self.b = b注意到我们使用了一个给定的非线性函数作为隐藏层的激活函数。默认情况下是tanh,不过在许多情况下我们想使用其他激活函数:
lin_output = T.dot(input, self.W) + self.b self.output = ( lin_output if activation is None else activation(lin_output) )如果深入原理部分,这个类实现graph的时候需要计算隐藏层的值 。如果给graph的输入和LogisticRegression类一样,就像之前的教程一样,就可以得到MLP的输出。下面的是MLP类的简单实现代码:
class MLP(object): """Multi-Layer Perceptron Class A multilayer perceptron is a feedforward artificial neural network model that has one layer or more of hidden units and nonlinear activations. Intermediate layers usually have as activation function tanh or the sigmoid function (defined here by a ``HiddenLayer`` class) while the top layer is a softmax layer (defined here by a ``LogisticRegression`` class). """ def __init__(self, rng, input, n_in, n_hidden, n_out): """Initialize the parameters for the multilayer perceptron :type rng: numpy.random.RandomState :param rng: a random number generator used to initialize weights :type input: theano.tensor.TensorType :param input: symbolic variable that describes the input of the architecture (one minibatch) :type n_in: int :param n_in: number of input units, the dimension of the space in which the datapoints lie :type n_hidden: int :param n_hidden: number of hidden units :type n_out: int :param n_out: number of output units, the dimension of the space in which the labels lie """ # Since we are dealing with a one hidden layer MLP, this will translate # into a HiddenLayer with a tanh activation function connected to the # LogisticRegression layer; the activation function can be replaced by # sigmoid or any other nonlinear function self.hiddenLayer = HiddenLayer( rng=rng, input=input, n_in=n_in, n_out=n_hidden, activation=T.tanh ) # The logistic regression layer gets as input the hidden units # of the hidden layer self.logRegressionLayer = LogisticRegression( input=self.hiddenLayer.output, n_in=n_hidden, n_out=n_out )在这个教程中,我们同样会使用L1和L2正则化( L1 and L2 regularization)。同时我们需要计算L1范数和权重 的L2范数的平方:
# L1 norm ; one regularization option is to enforce L1 norm to # be small self.L1 = ( abs(self.hiddenLayer.W).sum() + abs(self.logRegressionLayer.W).sum() ) # square of L2 norm ; one regularization option is to enforce # square of L2 norm to be small self.L2_sqr = ( (self.hiddenLayer.W ** 2).sum() + (self.logRegressionLayer.W ** 2).sum() ) # negative log likelihood of the MLP is given by the negative # log likelihood of the output of the model, computed in the # logistic regression layer self.negative_log_likelihood = ( self.logRegressionLayer.negative_log_likelihood ) # same holds for the function computing the number of errors self.errors = self.logRegressionLayer.errors # the parameters of the model are the parameters of the two layer it is # made out of self.params = self.hiddenLayer.params + self.logRegressionLayer.params就像之前一样,通过MSGD来训练,不同之处在于我们会修改cost函数,使得它包含正则化项。L1_reg和L2_reg是用来控制整个cost函数中的正则化项权重的超参数。新的cost的代码如下:
# the cost we minimize during training is the negative log likelihood of # the model plus the regularization terms (L1 and L2); cost is expressed # here symbolically cost = ( classifier.negative_log_likelihood(y) + L1_reg * classifier.L1 + L2_reg * classifier.L2_sqr )然后使用梯度来更新模型的参数。这里的代码差不多和逻辑回归的代码一样。只有参数的个数不同。为了避开这个问题(然代码可以用在任意数量的参数上),我们将会创建带有params的模型来生成参数列表然后对它进行解析,每一步计算一个替代:
# compute the gradient of cost with respect to theta (sotred in params) # the resulting gradients will be stored in a list gparams gparams = [T.grad(cost, param) for param in classifier.params] # specify how to update the parameters of the model as a list of # (variable, update expression) pairs # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of # same length, zip generates a list C of same size, where each element # is a pair formed from the two lists : # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)] updates = [ (param, param - learning_rate * gparam) for param, gparam in zip(classifier.params, gparams) ] # compiling a Theano function `train_model` that returns the cost, but # in the same time updates the parameter of the model based on the rules # defined in `updates` train_model = theano.function( inputs=[index], outputs=cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size], y: train_set_y[index * batch_size: (index + 1) * batch_size] } )
三、将上面的部分合并到一起
在了解了基本的概念之后,写一个MLP类变得非常容易。下面的代码就是过程,类似于之前的LR实现的方式:
""" This tutorial introduces the multilayer perceptron using Theano. A multilayer perceptron is a logistic regressor where instead of feeding the input to the logistic regression you insert a intermediate layer, called the hidden layer, that has a nonlinear activation function (usually tanh or sigmoid) . One can use many such hidden layers making the architecture deep. The tutorial will also tackle the problem of MNIST digit classification. .. math:: f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))), References: - textbooks: "Pattern Recognition and Machine Learning" - Christopher M. Bishop, section 5 """ __docformat__ = 'restructedtext en' import os import sys import time import numpy import theano import theano.tensor as T from logistic_sgd import LogisticRegression, load_data # start-snippet-1 class HiddenLayer(object): def __init__(self, rng, input, n_in, n_out, W=None, b=None, activation=T.tanh): """ Typical hidden layer of a MLP: units are fully-connected and have sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) and the bias vector b is of shape (n_out,). NOTE : The nonlinearity used here is tanh Hidden unit activation is given by: tanh(dot(input,W) + b) :type rng: numpy.random.RandomState :param rng: a random number generator used to initialize weights :type input: theano.tensor.dmatrix :param input: a symbolic tensor of shape (n_examples, n_in) :type n_in: int :param n_in: dimensionality of input :type n_out: int :param n_out: number of hidden units :type activation: theano.Op or function :param activation: Non linearity to be applied in the hidden layer """ self.input = input # end-snippet-1 # `W` is initialized with `W_values` which is uniformely sampled # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden)) # for tanh activation function # the output of uniform if converted using asarray to dtype # theano.config.floatX so that the code is runable on GPU # Note : optimal initialization of weights is dependent on the # activation function used (among other things). # For example, results presented in [Xavier10] suggest that you # should use 4 times larger initial weights for sigmoid # compared to tanh # We have no info for other function, so we use the same as # tanh. if W is None: W_values = numpy.asarray( rng.uniform( low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)), size=(n_in, n_out) ), dtype=theano.config.floatX ) if activation == theano.tensor.nnet.sigmoid: W_values *= 4 W = theano.shared(value=W_values, name='W', borrow=True) if b is None: b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) b = theano.shared(value=b_values, name='b', borrow=True) self.W = W self.b = b lin_output = T.dot(input, self.W) + self.b self.output = ( lin_output if activation is None else activation(lin_output) ) # parameters of the model self.params = [self.W, self.b] # start-snippet-2 class MLP(object): """Multi-Layer Perceptron Class A multilayer perceptron is a feedforward artificial neural network model that has one layer or more of hidden units and nonlinear activations. Intermediate layers usually have as activation function tanh or the sigmoid function (defined here by a ``HiddenLayer`` class) while the top layer is a softmax layer (defined here by a ``LogisticRegression`` class). """ def __init__(self, rng, input, n_in, n_hidden, n_out): """Initialize the parameters for the multilayer perceptron :type rng: numpy.random.RandomState :param rng: a random number generator used to initialize weights :type input: theano.tensor.TensorType :param input: symbolic variable that describes the input of the architecture (one minibatch) :type n_in: int :param n_in: number of input units, the dimension of the space in which the datapoints lie :type n_hidden: int :param n_hidden: number of hidden units :type n_out: int :param n_out: number of output units, the dimension of the space in which the labels lie """ # Since we are dealing with a one hidden layer MLP, this will translate # into a HiddenLayer with a tanh activation function connected to the # LogisticRegression layer; the activation function can be replaced by # sigmoid or any other nonlinear function self.hiddenLayer = HiddenLayer( rng=rng, input=input, n_in=n_in, n_out=n_hidden, activation=T.tanh ) # The logistic regression layer gets as input the hidden units # of the hidden layer self.logRegressionLayer = LogisticRegression( input=self.hiddenLayer.output, n_in=n_hidden, n_out=n_out ) # end-snippet-2 start-snippet-3 # L1 norm ; one regularization option is to enforce L1 norm to # be small self.L1 = ( abs(self.hiddenLayer.W).sum() + abs(self.logRegressionLayer.W).sum() ) # square of L2 norm ; one regularization option is to enforce # square of L2 norm to be small self.L2_sqr = ( (self.hiddenLayer.W ** 2).sum() + (self.logRegressionLayer.W ** 2).sum() ) # negative log likelihood of the MLP is given by the negative # log likelihood of the output of the model, computed in the # logistic regression layer self.negative_log_likelihood = ( self.logRegressionLayer.negative_log_likelihood ) # same holds for the function computing the number of errors self.errors = self.logRegressionLayer.errors # the parameters of the model are the parameters of the two layer it is # made out of self.params = self.hiddenLayer.params + self.logRegressionLayer.params # end-snippet-3 def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000, dataset='mnist.pkl.gz', batch_size=20, n_hidden=500): """ Demonstrate stochastic gradient descent optimization for a multilayer perceptron This is demonstrated on MNIST. :type learning_rate: float :param learning_rate: learning rate used (factor for the stochastic gradient :type L1_reg: float :param L1_reg: L1-norm's weight when added to the cost (see regularization) :type L2_reg: float :param L2_reg: L2-norm's weight when added to the cost (see regularization) :type n_epochs: int :param n_epochs: maximal number of epochs to run the optimizer :type dataset: string :param dataset: the path of the MNIST dataset file from http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz """ datasets = load_data(dataset) train_set_x, train_set_y = datasets[0] valid_set_x, valid_set_y = datasets[1] test_set_x, test_set_y = datasets[2] # compute number of minibatches for training, validation and testing n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size ###################### # BUILD ACTUAL MODEL # ###################### print '... building the model' # allocate symbolic variables for the data index = T.lscalar() # index to a [mini]batch x = T.matrix('x') # the data is presented as rasterized images y = T.ivector('y') # the labels are presented as 1D vector of # [int] labels rng = numpy.random.RandomState(1234) # construct the MLP class classifier = MLP( rng=rng, input=x, n_in=28 * 28, n_hidden=n_hidden, n_out=10 ) # start-snippet-4 # the cost we minimize during training is the negative log likelihood of # the model plus the regularization terms (L1 and L2); cost is expressed # here symbolically cost = ( classifier.negative_log_likelihood(y) + L1_reg * classifier.L1 + L2_reg * classifier.L2_sqr ) # end-snippet-4 # compiling a Theano function that computes the mistakes that are made # by the model on a minibatch test_model = theano.function( inputs=[index], outputs=classifier.errors(y), givens={ x: test_set_x[index * batch_size:(index + 1) * batch_size], y: test_set_y[index * batch_size:(index + 1) * batch_size] } ) validate_model = theano.function( inputs=[index], outputs=classifier.errors(y), givens={ x: valid_set_x[index * batch_size:(index + 1) * batch_size], y: valid_set_y[index * batch_size:(index + 1) * batch_size] } ) # start-snippet-5 # compute the gradient of cost with respect to theta (sotred in params) # the resulting gradients will be stored in a list gparams gparams = [T.grad(cost, param) for param in classifier.params] # specify how to update the parameters of the model as a list of # (variable, update expression) pairs # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of # same length, zip generates a list C of same size, where each element # is a pair formed from the two lists : # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)] updates = [ (param, param - learning_rate * gparam) for param, gparam in zip(classifier.params, gparams) ] # compiling a Theano function `train_model` that returns the cost, but # in the same time updates the parameter of the model based on the rules # defined in `updates` train_model = theano.function( inputs=[index], outputs=cost, updates=updates, givens={ x: train_set_x[index * batch_size: (index + 1) * batch_size], y: train_set_y[index * batch_size: (index + 1) * batch_size] } ) # end-snippet-5 ############### # TRAIN MODEL # ############### print '... training' # early-stopping parameters patience = 10000 # look as this many examples regardless patience_increase = 2 # wait this much longer when a new best is # found improvement_threshold = 0.995 # a relative improvement of this much is # considered significant validation_frequency = min(n_train_batches, patience / 2) # go through this many # minibatche before checking the network # on the validation set; in this case we # check every epoch best_validation_loss = numpy.inf best_iter = 0 test_score = 0. start_time = time.clock() epoch = 0 done_looping = False while (epoch < n_epochs) and (not done_looping): epoch = epoch + 1 for minibatch_index in xrange(n_train_batches): minibatch_avg_cost = train_model(minibatch_index) # iteration number iter = (epoch - 1) * n_train_batches + minibatch_index if (iter + 1) % validation_frequency == 0: # compute zero-one loss on validation set validation_losses = [validate_model(i) for i in xrange(n_valid_batches)] this_validation_loss = numpy.mean(validation_losses) print( 'epoch %i, minibatch %i/%i, validation error %f %%' % ( epoch, minibatch_index + 1, n_train_batches, this_validation_loss * 100. ) ) # if we got the best validation score until now if this_validation_loss < best_validation_loss: #improve patience if loss improvement is good enough if ( this_validation_loss < best_validation_loss * improvement_threshold ): patience = max(patience, iter * patience_increase) best_validation_loss = this_validation_loss best_iter = iter # test it on the test set test_losses = [test_model(i) for i in xrange(n_test_batches)] test_score = numpy.mean(test_losses) print((' epoch %i, minibatch %i/%i, test error of ' 'best model %f %%') % (epoch, minibatch_index + 1, n_train_batches, test_score * 100.)) if patience <= iter: done_looping = True break end_time = time.clock() print(('Optimization complete. Best validation score of %f %% ' 'obtained at iteration %i, with test performance %f %%') % (best_validation_loss * 100., best_iter + 1, test_score * 100.)) print >> sys.stderr, ('The code for file ' + os.path.split(__file__)[1] + ' ran for %.2fm' % ((end_time - start_time) / 60.)) if __name__ == '__main__': test_mlp()
用户可以如下这样运行这个代码:
输出会有如下的形式:
在Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz 上,该代码的速度大约为10.3 epoch/minute,并且在828 epochs的时候达到了测试错误率为1.65%。为了更好的了解MNIST上的结果,推荐读者去 this 看看不同算法结果比较。
四、训练MLPs的提示和技巧
在上面的代码中有许多超参数,它们不是被(通常来说也不能被)梯度下降而优化的。严格来说,找到一组最优超参数的值不是个容易解决的问题。首先,我们不能简单的将它们独立的进行优化。其次,我们不能容易的使用和前面介绍的梯度技术来处理(部分原因是因为一些参数离散值而另一些是实值)。第三,这个最优化问题不是凸优化和找到一个(局部)最小值的工作量可不小。好消息是在过去的25年中,研究者发明了各种经验规则来选择NN中的超参数。一个非常好的有关这些技巧的综述是由Yann LeCun ,Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller写的 Efficient BackProp。这里,我们归纳下这些同样的问题,并重点关注我们代码中实际用到的参数和技术。
非线性
两个最常用的激活函数就是tanh和sigmoid函数。和 Section 4.4,中解释的原因一样,这两个非线性是中心对称的,这它们就能在下一层的时候生成的是0均值的输入(这是一个理想的属性)。经验上来时,我们发现tanh有着更好的收敛特性。(当然在2015年现在有relu和prelu等其他的激活函数,有兴趣的可以了解下)。
权重初始化
在初始化的时候,我们想要权重围绕着原点(即数值0)足够小,这样激活函数就能呈现线性操作的趋势(这个看了sigmoid的函数图就能明白,在0点附近趋近于线性),在这个区域上梯度是最大的。其他理想的特性,特别对于深度网络来说,保存的激活函数的方差就像是从层到层的BP梯度的方差一样。这使得信息能够在网络中向上和向下传播,并且减少层间的差异。在某些假设的基础上,一个介于这两个约束条件的折中会导致有下面的初始化区别:
tanh的初始化:
sigmoid的初始化:
这里 是输入的个数, 是隐藏单元的个数。数学上的思考可以参考 [Xavier10]。
学习率
有许多文献是关注于如何选取一个好的学习率。最简单的解决方法就是简单的选择一个常量。经验规则:尝试几个log空间的值(),并缩小(对数)网格区域搜索到你得到的最低验证集误差的那个区域。
随着时间来降低学习率是一个好想法,简单的方法就是 ,这里 是初始化率(一般是用上面说的网格搜索技术来选择的), 被称为“下降常量”用来控制学习率下降的速率(通常来说,是一个更小的正数, 或者更小), 是epoch//stage。
Section 4.7 详细介绍了为每个参数(权重)选择一个学习率的过程,和基于分类器的误差来自适应的对它们进行选择。
隐藏单元个数
超参数是非常的数据集依赖的。含糊的说,更复杂的输入分布就需要具有更大能力(capacity)的网络来对它进行建模,同样的也就需要更多的隐藏单元(注意到一层中权重的个数,这通常是一个更加直观的可以用来测量网络能力(capacity)的方法,也就是 (是输入单元的数量,而是隐藏单元的数量))。
除非我们使用一些正则化方案(早期停止或者L1/L2惩罚),隐藏单元个数 vs 泛化效果graph这两者呈现的是U的形状(即在中间某个点上是最好的权衡点,两头都是独立上升的)。
正则化参数
通常用来试探L1/L2正则化参数 的值是 。在这个框架中,我们到目前介绍的 优化这些参数不会明显的得到更好的结果,不过却值得探索。
参考资料:
[1] 官网:http://deeplearning.net/tutorial/mlp.html#mlp
[2] Deep learning with Theano 官方中文教程(翻译)(三)——多层感知机(MLP):http://www.cnblogs.com/charleshuang/p/3648804.html