@MitoY 2016-07-23T12:53:32.000000Z 字数 4705 阅读 1759

Notes on CS231n (3)

This is a note of the 5th and 6th lectures of CS231n by Jing lei. This note is focused on training neural networks.

Overview

One time setup
mini-batch SGD, activation functions, preprocessing, weight
initialization, regularization, gradient checking
Training dynamics
babysitting the learning process,
parameter updates, hyperparameter optimization
Evaluation
model ensembles

Mini-batch SGD

Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

Activation Functions

Sigmoid: $\sigma(x) = 1/ (1 + e^{-x})$
tanh: $\tanh(x)$
ReLU: $\max(0, x)$
Leaky ReLU: $\max(0.1x, x)$
Maxout: $\max(\omega_{1}^{T} x + b_{1}, \omega_{2}^{T} x + b_{2})$
ELU: $\text{if x > 0, }f(x) = x \text{ ; else, } f(x) = \alpha(exp(x) - 1)$

Sigmoid is historically popular but no longer. What's wrong with it?
1. Saturated neurons “kill” the gradients.
2. Sigmoid outputs are not zero-centered. When the input to a neuron is always positive, What can we say about the gradients on w? Always all positive or all negative :(this is also why you want zero-mean data!).
3. exp() is a bit compute expensive

ReLU(Rectified Linear Unit):
(advantages)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
(disadvantages)
- Not zero-centered output
- Can make neurons ``die'', dead ReLU will never activate $\to$ people like to initialize ReLU neurons with lightly positive biases

Leaky ReLU will not die. But there's now no evidence showing Leaky ReLU works better than ReLU (according to the teacher).

Maxout “Neuron”
- Does not have the basic form of dot product -> nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

(TLDR) In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

Data Preprocessing

zero-centering
normalization
PCA (decorrelating)
Whitening

TLDR: In practice for Images: center only

Weight Initialization

First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
Because $E(xW) = E(x) * D$ where D = W.shape[0], and variation of W is 1. This method fix the variation for $D = 10^4$ .
"Xavier initialization"

W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)

For tanh networks it works well, but when using ReLU it breaks.

Proper initialization is (still) an active area of research...

Batch Normalization

“you want unit gaussian activations? just make them so.”

$\hat{x}^{(k)} = \frac{x^{(k)} - E(x^{(k)})}{\sqrt{Var(x^{(k)})}}$

And then allow the network to squash the range if it wants to

$y^{(k)} = \gamma^{(k)} \hat{x}^{(k)} + \beta^{(k)}$

Batch Normalization layer is usually inserted after Fully Connected layers, and before nonlinearity.

Advantages:
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe?

Babysitting the Learning Process

Step 1: Preprocess the data
Step 2: Choose the architecture
Double check that the loss is reasonable
Lets try to train now...
Tip: Make sure that you can overfit very small portion of the training data. You should tune the learning rate and other hyperparameters.

Hyperparameter Optimization

Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

Note it’s best to optimize in log space. And it's better to use random search instead of grid search. Random search provides you with more information about every dimension of the hyperparameter space.

max_count = 100
for cout in xrange(max_count):
    reg = 10.0 ** np.random.uniform(-5, 5)
    lr = 10.0 ** np.random.uniform(-3, -6)
    ...

How learning rate affects the reduction prosedure of the loss.

figure_1
figure_2

Parameter Update Schemes

SGD (stochastic gradient descent)

x += - learning_rate * dx

dx refers to $\partial L / \partial x$

Momentum update

mu = 0.9 # damping
v = mu * v - learning_rate * dx
x += v

Nesterov Momentum update
(nag = Nesterov Accelerated Gradient)
AdaGrad update

cache += dx**2
x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)

RMSProp update

cache += decay_rate * cache + (1 - decay_rate) * dx**2
x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)

Improved from AdaGrad, introduced in a slide in Geoff Hinton’s Coursera class, lecture 6.

Adam update

m = beta1 * m + (1 - beta1) * dx              # momentum
v = beta2 * v + (1 - beta2) * np.square(dx)    # RMSProp-like
next_x = x - learning_rate * m / (np.sqrt(v) + 1e-7)

Looks a bit like RMSProp with momentum.

Evaluation: Model Ensembles

Train multiple independent models
At test time average their results
(Enjoy 2% extra performance)

Fun Tips/Tricks:
- can also get a small boost from averaging multiple
model checkpoints of a single model.
- keep track of (and use at test time) a running average
parameter vector.

Regularization (dropout)

Dropout: Randomly set some neurons to zero in the forward pass.

It forces the network to have a redundant representation (improves regularization).

Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~one datapoint.

At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
It is more common to use inverted dropout other than scaling the activations.