@MitoY
2016-07-17T14:35:03.000000Z
字数 6029
阅读 958
This is a note of the first four lectures of CS231n by Jing lei.
The general method to minimize loss function is gradient descent:
The step size is also called the learning rate.
There are two ways to compute , numeric and analytic. In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.
Chain rule:
Mini-batch Stochastic Gradient Descent - to improve performance, only use a small portion of the training set to compute the gradient.
Common mini-batch sizes are 32/64/128 examples.
loop:
Sample a batch of data (randomly)
Forward prop it through the graph, get loss
Backprop to calculate the gradient
Update the parameters using the gradient
A linear classifier that uses the Multiclass SVM loss function.
Score function:
Below we use ``'' to denote the matrix . And we ignore regularization loss.
Gradient (two parts):
You should combine these two parts together to get the correct gradient. And..
A linear classifer that uses the Softmax + Cross-entropy loss function.
Score function:
We denote the SVM scores as , and Softmax scores as .
Loss function:
You should combine the two parts together to get the correct gradient. And..
Numeric stability:
To avoid overflow while calculating Softmax scores, we often subtract by a constant , and the Softmax scores remain unchanged.
Score function:
Loss function: (Softmax)
Gradient:
We already know how to compute Softmax gradient. That is, we already know , or more generally, . Then by chain rule, we have
If we denote with dX for any veriable , then the code could be written as:
#numpy codedW2 = X2.T.dot(dS)dX1 = dS.dot(W2.T) * (X1 > 0)dW1 = X.T.dot(dX1)
How do you write a really complex formula into a vector form? For example,
This is one of the two parts of the gradient of SVM loss function. It's easy to write a loop to do the calculation. But vectorization really boost the speed of your program. Once it took me a really tough time trying to figure out the vectorization. But in fact it's not that difficult. Notice that the condition in the formula depends only on two indexes and , and its value is either True or False. Then we could change it into a matrix form:
In numpy this is written as
# numpy codemargins = scores - correct_class_score + deltafor i in xrange(num_train):margins[i, y[i]] = 0B = (margins > 0)
Then you just multiply with
So
# numpy codedW += np.dot(X.T, B) / num_train
The second part of the gradient of SVM loss function:
# numpy codeM = np.sum(margin>0, axis=1)
Then
Sum over we have
How to vectorize this formula?
We define another matrix to handle it.
# numpy codeB = (np.arange(num_classes).reshape(num_classes,1) == y)
Having this matrix , we could write the formula into a simple matrix multiplication.
# numpy codedW += - np.dot(X.T * M, B.T) / num_train
I trained those models on cifar-10, and tried to find the best hyperparameters. The accuracies on validation set are
| model | accuracy |
|---|---|
| knn | 0.33 |
| SVM | 0.40 |
| Softmax | 0.40 |
| 2-layer network | 0.53 |
| features | 0.59 |
The ``features'' is a 2-layer network classifier on the Histogram of Oriented Gradients (HOG) as well as the color histogram using the hue channel in HSV color space of cifar-10 images.