@MitoY 2016-07-17T14:35:03.000000Z 字数 6029 阅读 958

Notes on CS231n (2)

This is a note of the first four lectures of CS231n by Jing lei.

Notes on CS231n (2)

Notations

training data: $X$ - (num_train x D)
test data: $X$ - (num_train x D)
label: $y$ - (num_train)
score: $S$ - (num_train x num_classes)
$W$ - (D x num_classes)

Optimization

Gradient descent

The general method to minimize loss function is gradient descent:

$\mathrm{loop} \{ W = W - \alpha \nabla_W L \}$

The step size $\alpha$ is also called the learning rate.
There are two ways to compute $\nabla_W L$ , numeric and analytic. In practice: Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

Backpropagation

Chain rule:

$\frac{\partial f(\mathbf{y}(\mathbf{x}))}{\partial x_j} = \displaystyle\sum_{i} \frac{\partial f(\mathbf{y})}{\partial y_i} \frac{\partial y_i(\mathbf{x})}{\partial x_j}$

Mini-batch SGD

Mini-batch Stochastic Gradient Descent - to improve performance, only use a small portion of the training set to compute the gradient.
Common mini-batch sizes are 32/64/128 examples.

loop:
Sample a batch of data (randomly)
Forward prop it through the graph, get loss
Backprop to calculate the gradient
Update the parameters using the gradient

Linear SVM

A linear classifier that uses the Multiclass SVM loss function.

Score function:

$S = f(X;W) = XW$
Loss function:

$L_i = \displaystyle\sum_{j \neq y_i} max\left(0, S_{ij} + \Delta - S{i y_i} \right) \\ =\displaystyle\sum_{j \neq y_i} max\left(0, \displaystyle\sum_{l} \left( X_{il}W_{lj} + \Delta - X_{il} W_{ly_i}\right) \right)$

$L = \frac{1}{N}\displaystyle\sum_i L_i$

Below we use `` $\mathrm{margin}$ '' to denote the matrix $(S_{ij} + \Delta - S{i y_i} )$ . And we ignore regularization loss.

Gradient (two parts):

$\frac{\partial L_i}{\partial W_{lj}} = X_{il} \text{, if } \mathrm{margin}_{ij} > 0 \text{ and } j \neq y_i$

$\frac{ \partial L_i }{ \partial W_{l y_i}} = -X_{il} \displaystyle\sum_{j \neq y_i} 1\left( \mathrm{margin}_{ij} > 0 \right)$

You should combine these two parts together to get the correct gradient. And..

$\nabla_W L = \frac{1}{N} \displaystyle\sum_i \left( \nabla_W L_i \right)$

Softmax

A linear classifer that uses the Softmax + Cross-entropy loss function.

Score function:
We denote the SVM scores as $S$ , and Softmax scores as $S'$ .

$S'_{ij} = \frac{ e^{S_{i j}}}{\displaystyle\sum_j e^{S_{ij}}}$

Loss function:

$L_i = -\log \frac{e^{S_{iy_i}}}{\displaystyle\sum_j e^{S_{ij}}} = -\log S'_{i y_i} = -S_{i y_i} + \log \displaystyle\sum_j e^{S_{ij}}$

$L = \frac{1}{N}\displaystyle\sum_i L_i$
Gradient (two parts):

$\frac{ \partial L_i }{ \partial W_{l y_i}} = -X_{il}$

$\frac{ \partial L_i }{ \partial W_{l j}} = S'_{ij}X_{il}$

You should combine the two parts together to get the correct gradient. And..

$\nabla_W L = \frac{1}{N} \displaystyle\sum_i \nabla_W L_i$

Numeric stability:
To avoid overflow while calculating Softmax scores, we often subtract $S_{ij}$ by a constant $C = \displaystyle\max_{j}S_{ij}$ , and the Softmax scores remain unchanged.

$S'_{ij} = \frac{ e^{S_{i j}}}{\displaystyle\sum_j e^{S_{ij}}} = \frac{ e^{-C} e^{S_{i j}}}{ e^{-C} \displaystyle\sum_j e^{S_{ij}}} = \frac{ e^{S_{i j}-C}}{\displaystyle\sum_j e^{S_{ij} - C}}$

Two layer network

Score function:

$S = W_{2}X^{(2)} = W_{2} \max(0, X^{(1)}) = W_{2} \max(0, W_{1}X)$

Loss function: (Softmax)

$L_i = -\log \frac{e^{S_{iy_i}}}{\displaystyle\sum_j e^{S_{ij}}} , \,\,\, L = \frac{1}{N}\displaystyle\sum_i L_i$

Gradient:
We already know how to compute Softmax gradient. That is, we already know ${\partial L}/{\partial X_{ij}^{(2)}}$ , or more generally, ${\partial L}/{\partial S_{ij}}$ . Then by chain rule, we have

$\frac{\partial L}{\partial W_{2}} = \frac{\partial L}{\partial S} \frac{\partial S}{\partial W_{2}}$

$\frac{\partial L}{\partial W_{1}} = \frac{\partial L}{\partial S} \frac{\partial S}{\partial X^{(2)}}\frac{\partial X^{(2)}}{\partial X^{(1)}} \frac{\partial X^{(1)}}{\partial W_{1}}$

If we denote ${\partial L}/{\partial X}$ with dX for any veriable $X$ , then the code could be written as:

#numpy code
dW2 = X2.T.dot(dS) 
dX1 = dS.dot(W2.T) * (X1 > 0)
dW1 = X.T.dot(dX1)

Vectorization

case 1

How do you write a really complex formula into a vector form? For example,

$\frac{\partial L_i}{\partial W_{lj}} = X_{il} \text{, if } \mathrm{margin}_{ij} > 0 \text{ and } j \neq y_i$

This is one of the two parts of the gradient of SVM loss function. It's easy to write a loop to do the calculation. But vectorization really boost the speed of your program. Once it took me a really tough time trying to figure out the vectorization. But in fact it's not that difficult. Notice that the condition in the formula $(\mathrm{margin}_{ij} > 0 \text{ and } j \neq y_i)$ depends only on two indexes $i$ and $j$ , and its value is either True or False. Then we could change it into a matrix form:

$B_{ij} = 1 \text{, if } \mathrm{margin}_{ij} > 0 \text{ and } j \neq y_i \\ \text{otherwise }B_{ij} = 0$

In numpy this is written as

# numpy code
margins = scores - correct_class_score + delta 
for i in xrange(num_train):
    margins[i, y[i]] = 0
B = (margins > 0)

Then you just multiply $X$ with $B$

$\frac{\partial L_i}{\partial W_{lj}} = X_{il} B_{ij}$

$\frac{\partial L}{\partial W_{lj}} =\frac{1}{N}\displaystyle\sum_i X_{il} B_{ij} = \frac{1}{N} (X^T B)_{lj}$

$\implies \nabla_W L = \frac{1}{N} X^{T} B$

# numpy code
dW += np.dot(X.T, B) / num_train

case 2

The second part of the gradient of SVM loss function:

$\frac{ \partial L_i }{ \partial W_{l y_i}} = -X_{il} \displaystyle\sum_{j \neq y_i} 1\left( \mathrm{margin}_{ij} > 0 \right)$
Notice that

$\displaystyle\sum_{j \neq y_i} 1\left( \mathrm{margin}_{ij} > 0 \right)$ depends on one index

$i$ , we use a vector

$M$ to denote it.

$M_i = \displaystyle\sum_{j \neq y_i} 1\left( \mathrm{margin}_{ij} > 0 \right)$

# numpy code
M = np.sum(margin>0, axis=1)

Then

$\frac{ \partial L_i }{ \partial W_{l y_i}} = -M_{i} X_{il}$

Sum over $i$ we have

$\frac{ \partial L }{ \partial W_{l j}} = \frac{1}{N} \displaystyle\sum_{\{ i |y_i = j\} } -M_{i} X_{il}$

How to vectorize this formula?
We define another matrix $B$ to handle it.

$B_{ji} = 1 \text{ , if } y_i = j \\ \text{ otherwise } B_{ji} = 0$

# numpy code
B = (np.arange(num_classes).reshape(num_classes,1) == y)

Having this matrix $B$ , we could write the formula into a simple matrix multiplication.

$\frac{ \partial L }{ \partial W_{l j}} = \frac{1}{N} \displaystyle\sum_{i } -M_{i} X_{il} B_{ji} = \frac{1}{N} \displaystyle\sum_{i } - X^{T}_{li} M_{i} B^{T}_{ij}$
This is simply

# numpy code
dW += - np.dot(X.T * M, B.T) / num_train

Accuracies in practice

I trained those models on cifar-10, and tried to find the best hyperparameters. The accuracies on validation set are

model	accuracy
knn	0.33
SVM	0.40
Softmax	0.40
2-layer network	0.53
features	0.59

The ``features'' is a 2-layer network classifier on the Histogram of Oriented Gradients (HOG) as well as the color histogram using the hue channel in HSV color space of cifar-10 images.