[关闭]
@MitoY 2016-07-23T12:53:32.000000Z 字数 4705 阅读 1759

Notes on CS231n (3)

This is a note of the 5th and 6th lectures of CS231n by Jing lei. This note is focused on training neural networks.

Overview

  1. One time setup
    mini-batch SGD, activation functions, preprocessing, weight
    initialization, regularization, gradient checking

  2. Training dynamics
    babysitting the learning process,
    parameter updates, hyperparameter optimization

  3. Evaluation
    model ensembles

Mini-batch SGD

Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient

Activation Functions

Sigmoid is historically popular but no longer. What's wrong with it?
1. Saturated neurons “kill” the gradients.
2. Sigmoid outputs are not zero-centered. When the input to a neuron is always positive, What can we say about the gradients on w? Always all positive or all negative :(this is also why you want zero-mean data!).
3. exp() is a bit compute expensive

ReLU(Rectified Linear Unit):
(advantages)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
(disadvantages)
- Not zero-centered output
- Can make neurons ``die'', dead ReLU will never activate people like to initialize ReLU neurons with lightly positive biases

Leaky ReLU will not die. But there's now no evidence showing Leaky ReLU works better than ReLU (according to the teacher).

Maxout “Neuron”
- Does not have the basic form of dot product -> nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

(TLDR) In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

Data Preprocessing

TLDR: In practice for Images: center only

Weight Initialization

  1. W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)

For tanh networks it works well, but when using ReLU it breaks.

Proper initialization is (still) an active area of research...

Batch Normalization

“you want unit gaussian activations? just make them so.”

And then allow the network to squash the range if it wants to

Batch Normalization layer is usually inserted after Fully Connected layers, and before nonlinearity.

Advantages:
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe?

Babysitting the Learning Process

Step 1: Preprocess the data
Step 2: Choose the architecture
Double check that the loss is reasonable
Lets try to train now...
Tip: Make sure that you can overfit very small portion of the training data. You should tune the learning rate and other hyperparameters.

Hyperparameter Optimization

Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

Note it’s best to optimize in log space. And it's better to use random search instead of grid search. Random search provides you with more information about every dimension of the hyperparameter space.

  1. max_count = 100
  2. for cout in xrange(max_count):
  3. reg = 10.0 ** np.random.uniform(-5, 5)
  4. lr = 10.0 ** np.random.uniform(-3, -6)
  5. ...

figure_1
figure_2

Parameter Update Schemes

  1. x += - learning_rate * dx

dx refers to

  1. mu = 0.9 # damping
  2. v = mu * v - learning_rate * dx
  3. x += v
  1. cache += dx**2
  2. x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)
  1. cache += decay_rate * cache + (1 - decay_rate) * dx**2
  2. x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)

Improved from AdaGrad, introduced in a slide in Geoff Hinton’s Coursera class, lecture 6.

  1. m = beta1 * m + (1 - beta1) * dx # momentum
  2. v = beta2 * v + (1 - beta2) * np.square(dx) # RMSProp-like
  3. next_x = x - learning_rate * m / (np.sqrt(v) + 1e-7)

Looks a bit like RMSProp with momentum.

Evaluation: Model Ensembles

  1. Train multiple independent models
  2. At test time average their results
    (Enjoy 2% extra performance)

Fun Tips/Tricks:
- can also get a small boost from averaging multiple
model checkpoints of a single model.
- keep track of (and use at test time) a running average
parameter vector.

Regularization (dropout)

Dropout: Randomly set some neurons to zero in the forward pass.

It forces the network to have a redundant representation (improves regularization).

Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~one datapoint.

At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
It is more common to use inverted dropout other than scaling the activations.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注