@Frankchen 2016-02-25T07:49:19.000000Z 字数 5496 阅读 1613

Notes of Hinton's lecture

Deep-learning

Lecture 1: Introduction

Three types of learning

The goal for unsupervised learning is that to

Cerate an internal representation of the input that is useful for subsequent supervised or reinforcement learning.
Privide a compact, low-dimensional representation of the input.
- High-dimensional inputs live on or near a low-dimensional manifold(or several such manifolds).
- Principal Component Analysis(PCA, a popular linear dimension reduction algorithm) is a widely used linear method for finding a low-dimensional representation.
Provide an economical high-dimensional representation of the input in terms of learned features.
- Binary features are economical for represented by 0/1 so that it takes 1 bite to be saved in computer.
- Real-valued faetures that are nearly all zero.
Find a sensible clusters in the input.

Lecture 2a: An overview of the main types of neural network architecture

Feed-forward neural networks

A basic structure of feed-forward neural networks

Feed-forward neural networks are the commonest type of neural network in practical applications. Such as its first layer is the input and the last layer is the output, and if it has more than one hidden layer, we call them "deep" neural networks.
Feed-forward neural networks compute a series of transformations that change the similarities between cases (In speech recognition, for example we'd like the same thing said by different speakers to become more similar, and different said by the same speaker to become less similar as we go through the different layers). The activities of the neurons in each layer are a non-linear function of the activities in the layer below.

Recurrent networks

These kind of networks have directed cycles in their connection graph. It has complicated dynamics and this can make them very difficult to train.

Recurrent neural networks are a very natural way to model sequential data.
They have the ability to remember information in their hidden state for a long time.

Lecture 2b: Perceptrons:The first generation of neural networks

Perceptrons were popularized by Frank Rosenblatt in the early 1960’s. And then in 1969, Minsky and Papert published a book called “Perceptrons” that analysed what they could do and showed their limitations. This leaded to a bad result that many people thought these limitations applied to all neural network modles(It's not correct, for example, one layer of perceptrons couldn't compute the XOR operation, however, two layers of perceptrons could do it for XOR could represented by AND operation and OR operation).
The learning algorithm of the perceptron convergence procedure is training binary output neurons as classifiers which can be explained as follows:

Add an extra component with value 1 to each input vector. The “bias” weight on this component is minus the threshold. Now we can forget the threshold.
Pick training cases using any policy that ensures that every training case will keep getting picked.

1.If the output unit is correct, leave its weights alone.
2.If the output unit incorrectly outputs a zero, add the input vector to the weight vector.
3.If the output unit incorrectly outputs a 1, subtract the input vector from the weight vector.
This is guaranteed to find a set of weights that gets the right answer for all the training cases if any such set exists.

Lecture 2c: A geometrical view of perceptrons

A geometrical view of perceptrons
In the weight space view, the weights represent points while the inputs represent planes. Another term for what the inputs represent is constraints(The inputs will constrain the set of weights that give the correct classification results).

Lecture 2d: Why the learning works

Lecture 2e: The limitations of perceptrons

此处输入图片的描述
Picture above is a simple explanation for a kind limitation of perceptrons: one layer of perceptrons couldn't do XOR operation.

此处输入图片的描述
Here comes a question to be answered for why a binary desion unit cannot discriminate pattens with the same number of on pixels(assuming translation with wraparound). My explanation is as follows: since the number of pixels of pattern A and pattern B is the same, and both of the two kinds of pattern allow wrapround, so, their possible cases or their number of possible positions of A and B are the same. This will cause that they will vote the weights for the same which leads to that the program cannot distinguish which pattern the input belongs to.
Here is a example from the discuss forum of the course:

"Simplified case: only 5 pixels, want to recognize between two different patterns where two pixels are on (first pixel in first example of each pattern is bolded so that translations can be seen easily):
Pattern A --> [1 1 0 0 0], [0 1 1 0 0], [0 0 1 1 0], [0 0 0 1 1], and [1 0 0 0 1]
Pattern B --> [1 0 1 0 0], [0 1 0 1 0], [0 0 1 0 1], [1 0 0 1 0], and [0 1 0 0 1]
Now, during training, you will input each of the possible positions of pattern A and each of the possible positions of pattern B as training examples. Every time one of the pixels appears positive for a pattern, it will be like adding one vote for that pattern every time that pixel is on (equal to 1). So for pattern A, looking at the first pixel in all 5 examples of the pattern, we find that it is on in 2 of the 5 cases (namely the first and last), meaning two votes for pattern A when the first pixel is on. Similarly, for pattern B, looking at the first pixel in all 5 examples of the pattern, we find that it is on in 2 of the 5 cases (namely first and fourth), meaning two votes for pattern B when the first pixel is on. In fact, regardless of the pattern and the pixel, you will find that there are 2 votes for each pattern should that pixel be on, and nothing to break the tie. On a crude level, having a tie is exactly what it means to be incapable of distinguishing between cases.
For the explanation about the weights, the votes are essentially like the weights Professor Hinton is referring to. In our example, when 2 pixels are on, you get a total of 4 "votes" for each pattern or a total of 2 times the "votes per pixel" which is analogous to the sum of the weights (though vastly simplified). Since neither pattern's votes outweighs the other's, the program cannot distinguish which pattern the input belongs to." (Jason Michael Runkle)

A neural network without the hidden layers is limited, however its difficult to learn the weights of the hidden layers, so, for a long time, people believed that perceptrons and neural networks are not good.

Reference

Jason Michael Runkle.(2012，October 4). Lecture 2e binary unit, translation. Message posed to https://class.coursera.org/neuralnets-2012-001/forum/thread?thread_id=109.