参考:Convolutional Networks for Images, Speech, and Time-Series
传统的特征提取:gathers relevant information from the input and eliminates irrelevant variablilities.
这里优点:
1) standard, fully-connected multilayer networks
2) feeding the network with "raw" inputs(e.g. normalized images),
3) backpropagation to turn the first few layers into an appropriate feature extractor
4) a fully-connected first layer with, say a few 100 hidden units, would already contain several 10,000 weights
5) learning these weight configurations requires a very large number of training instances to cover the space of possible variations.
6) variables(or pixes) that are spatially or temporally nearby are highly correlated. Local correlations are the reasons for the well-known advantages of extracting and combining local features before recgnizing spatial or temporal objects. Convolutional networks force the extraction of local features by restricting the receptive fields of hidden units to be local.
7)Convolutional networks combine three architectural ideas to ensure some degree of shift and distortion invariance: local receptive fields, shared weights(or weight replication), and, sometiomes, spatial or temporal subsampling.
8) in addition, elementary feature detectors that are useful on one part of the image are likely to be useful across the entire image.This knowledge can be applied by forcing a set of units,whose receptive fields are located at different places on the image, to have identical weight vectors. The outputs of such a set of neurons constitute a feature map.
9) At each position, different types of units in different feature maps compute different types of features.
10) A sequential implementation of this, for each feature map, would be to scan the input image with a single neuron that has a local receptive field, and to store the states of this neuron at corresponding locations in the feature map. This operation is equivalent to a convolution with a small size kernel, followed by a squashing function.
11)The process can be performed in parallel by implementing the feature map as a plane of neurons that share a single weight vector.
12) A convolutional layer is usually composed of several feature maps(with differnt weight vectors), so that multiple features can be extracted at each location.
13) Once a feature has been detected, its exact location becomes less important, as long as its approximate position relative to other fetures is preserved.
14)Therefore, each convolutional layer is followed by an additional layer which performs a local averaging and a subsampling,reducing the resolution of the feature map,and reducing the sensitivity of the output to shifts and distortions.
15) at each layer, the number of feature maps is increased as the spatial resolution is decreased.
16) The weight sharing technique has the interesting side effect of reducing the number of free parameters, thereby reducing the "capacity" of the machine and improving its generalization ability on weight sharing and LEARNING AND GENERALIZATION.
17)The network in figure contains about 100,000 connections, but only about 2,600 free parameters because of the weight sharing.
缺点:
1) overfitting problems may occur if training data is scarce
2) memory requirement rule out certain hardware implementations
3) the main deficiency of unstructured nets for image or speech aplications is that they have no built-in invariance with respect to translations, or local distortions of the inputs
4) a deficiency of fully-connected architectures is that the topology of the input is entirely ignored.
一些解释:
1) The idea of connecting units to local receptive fields on the input goes back to the Perceptron in the early 60s, and was almost simultaneous with Hubel and Wiesel's discovery of locally-sensitive, orientation-selective neurons in the cat's visual sysem. With local receptive fields, neurons can extract elementary visual fetures such as oriented edges, end-points, corners(or similar features in speech spectrograms).
2) These features are then combined by the highter layers.