[关闭]
@MitoY 2016-07-17T14:35:03.000000Z 字数 6029 阅读 958

Notes on CS231n (2)

This is a note of the first four lectures of CS231n by Jing lei.

Notations


Optimization

Gradient descent

The general method to minimize loss function is gradient descent:

The step size is also called the learning rate.
There are two ways to compute , numeric and analytic. In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Backpropagation

Chain rule:

Mini-batch SGD

Mini-batch Stochastic Gradient Descent - to improve performance, only use a small portion of the training set to compute the gradient.
Common mini-batch sizes are 32/64/128 examples.

loop:
Sample a batch of data (randomly)
Forward prop it through the graph, get loss
Backprop to calculate the gradient
Update the parameters using the gradient

Linear SVM

A linear classifier that uses the Multiclass SVM loss function.

Score function:


Loss function:

Below we use ``'' to denote the matrix . And we ignore regularization loss.

Gradient (two parts):

You should combine these two parts together to get the correct gradient. And..

Softmax

A linear classifer that uses the Softmax + Cross-entropy loss function.

Score function:
We denote the SVM scores as , and Softmax scores as .

Loss function:


Gradient (two parts):

You should combine the two parts together to get the correct gradient. And..

Numeric stability:
To avoid overflow while calculating Softmax scores, we often subtract by a constant , and the Softmax scores remain unchanged.

Two layer network

Created with Raphaël 2.1.2Input (X)layer 1 (X1)ReLU (X2)layer 2 (S)Softmax loss

Score function:

Loss function: (Softmax)

Gradient:
We already know how to compute Softmax gradient. That is, we already know , or more generally, . Then by chain rule, we have

If we denote with dX for any veriable , then the code could be written as:

  1. #numpy code
  2. dW2 = X2.T.dot(dS)
  3. dX1 = dS.dot(W2.T) * (X1 > 0)
  4. dW1 = X.T.dot(dX1)

Vectorization

case 1

How do you write a really complex formula into a vector form? For example,

This is one of the two parts of the gradient of SVM loss function. It's easy to write a loop to do the calculation. But vectorization really boost the speed of your program. Once it took me a really tough time trying to figure out the vectorization. But in fact it's not that difficult. Notice that the condition in the formula depends only on two indexes and , and its value is either True or False. Then we could change it into a matrix form:

In numpy this is written as

  1. # numpy code
  2. margins = scores - correct_class_score + delta
  3. for i in xrange(num_train):
  4. margins[i, y[i]] = 0
  5. B = (margins > 0)

Then you just multiply with

So

  1. # numpy code
  2. dW += np.dot(X.T, B) / num_train

case 2

The second part of the gradient of SVM loss function:


Notice that depends on one index , we use a vector to denote it.

  1. # numpy code
  2. M = np.sum(margin>0, axis=1)

Then

Sum over we have

How to vectorize this formula?
We define another matrix to handle it.

  1. # numpy code
  2. B = (np.arange(num_classes).reshape(num_classes,1) == y)

Having this matrix , we could write the formula into a simple matrix multiplication.


This is simply

  1. # numpy code
  2. dW += - np.dot(X.T * M, B.T) / num_train

Accuracies in practice

I trained those models on cifar-10, and tried to find the best hyperparameters. The accuracies on validation set are

model accuracy
knn 0.33
SVM 0.40
Softmax 0.40
2-layer network 0.53
features 0.59

The ``features'' is a 2-layer network classifier on the Histogram of Oriented Gradients (HOG) as well as the color histogram using the hue channel in HSV color space of cifar-10 images.

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注