@mShuaiZhao
2018-01-02T13:16:50.000000Z
字数 1727
阅读 396
PaperReading
CNN
2017.12
The Motivation of CNN
First, the typical images are large
too many parameters in fully-connected neural network.
unstructured nets for image or speech applications have no built-in invariance with respect to translations, or local distortions of the input.
In CNN, shift invariance is automatically obtained by forcing the replication of weight configurations across space.
Second, unstructured nets ignore the topology of the input.
The input variables can be presented in any order without affecting the outcome of the training.
Images have a strong 2D local structure: variables that are spatially or temporally nearby are highly correlated.
CNN
CNN combine three architectural ideas to ensure some degree of shift, scale, and distortion invariance: local receptive fields, shared weights, sub-sampling
The idea of connecting units to local receptive fields on the inputs goes back to the Perceptron in the early 60s, and was almost simultaneous with Hubel and Wiesel discovery of locall-sensitive, orientation-selective neurons in the cat's visual system.
Elementary feature dectector that are useful on one part of the image are likely to be useful across the entire image.
Once a feature has been detected, its exact location becomes less important.
Not only is the precise position of each of those features irrelevant for identifying the pattern, it is potentially harmful because the position are likely to vary for different instances of the object.
e.g. character 7 of different sizes
A simple way is to reduce the spatial resolution.
It reduces the precision with which the position of distinctive features are encoded in a feature map.
use the sub-sampling layer
A large degree of invariance to geometric transformations of the input can be achieved with this progressive reduction of spatial resolution compensated by a progressive increase of the richness of the representation(the number of feature maps).
LeNet-5
7 layers
5x5 conv filters