@MitoY
2016-07-23T12:53:32.000000Z
字数 4705
阅读 1759
This is a note of the 5th and 6th lectures of CS231n by Jing lei. This note is focused on training neural networks.
One time setup
mini-batch SGD, activation functions, preprocessing, weight
initialization, regularization, gradient checking
Training dynamics
babysitting the learning process,
parameter updates, hyperparameter optimization
Evaluation
model ensembles
Loop:
1. Sample a batch of data
2. Forward prop it through the graph, get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Sigmoid is historically popular but no longer. What's wrong with it?
1. Saturated neurons “kill” the gradients.
2. Sigmoid outputs are not zero-centered. When the input to a neuron is always positive, What can we say about the gradients on w? Always all positive or all negative :(this is also why you want zero-mean data!).
3. exp() is a bit compute expensive
ReLU(Rectified Linear Unit):
(advantages)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
(disadvantages)
- Not zero-centered output
- Can make neurons ``die'', dead ReLU will never activate people like to initialize ReLU neurons with lightly positive biases
Leaky ReLU will not die. But there's now no evidence showing Leaky ReLU works better than ReLU (according to the teacher).
Maxout “Neuron”
- Does not have the basic form of dot product -> nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!
(TLDR) In practice:
- Use ReLU. Be careful with your learning rates
- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid
TLDR: In practice for Images: center only
First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)
Because where D = W.shape[0], and variation of W is 1. This method fix the variation for .
"Xavier initialization"
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
For tanh networks it works well, but when using ReLU it breaks.
Proper initialization is (still) an active area of research...
“you want unit gaussian activations? just make them so.”
And then allow the network to squash the range if it wants to
Batch Normalization layer is usually inserted after Fully Connected layers, and before nonlinearity.
Advantages:
- Improves gradient flow through the network
- Allows higher learning rates
- Reduces the strong dependence on initialization
- Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe?
Step 1: Preprocess the data
Step 2: Choose the architecture
Double check that the loss is reasonable
Lets try to train now...
Tip: Make sure that you can overfit very small portion of the training data. You should tune the learning rate and other hyperparameters.
Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)
Note it’s best to optimize in log space. And it's better to use random search instead of grid search. Random search provides you with more information about every dimension of the hyperparameter space.
max_count = 100for cout in xrange(max_count):reg = 10.0 ** np.random.uniform(-5, 5)lr = 10.0 ** np.random.uniform(-3, -6)...

x += - learning_rate * dx
dx refers to
mu = 0.9 # dampingv = mu * v - learning_rate * dxx += v
Nesterov Momentum update
(nag = Nesterov Accelerated Gradient)

AdaGrad update
cache += dx**2x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)
cache += decay_rate * cache + (1 - decay_rate) * dx**2x += -learning_rate * dx / (np.sqrt(cache) + 1e-7)
Improved from AdaGrad, introduced in a slide in Geoff Hinton’s Coursera class, lecture 6.
m = beta1 * m + (1 - beta1) * dx # momentumv = beta2 * v + (1 - beta2) * np.square(dx) # RMSProp-likenext_x = x - learning_rate * m / (np.sqrt(v) + 1e-7)
Looks a bit like RMSProp with momentum.
Fun Tips/Tricks:
- can also get a small boost from averaging multiple
model checkpoints of a single model.
- keep track of (and use at test time) a running average
parameter vector.
Dropout: Randomly set some neurons to zero in the forward pass.
It forces the network to have a redundant representation (improves regularization).
Another interpretation: Dropout is training a large ensemble of models (that share parameters). Each binary mask is one model, gets trained on only ~one datapoint.
At test time all neurons are active always
=> We must scale the activations so that for each neuron:
output at test time = expected output at training time
It is more common to use inverted dropout other than scaling the activations.