@nanmeng 2016-04-29T15:53:16.000000Z 字数 4432 阅读 1608

CS231n Notes -- Linear Classification

`notes`

Regularization Penalty $R(W)$ .

Notice that the regularization function is not a function of the data, it is only based on the weights.

Important points about SVM

For binary SVMs the expectation of the probability of commiting an error on a test example is bounded by the ratio of of the expectation of the number of training points that are support vectors to the number of examples in the training set

$E[P(error)] = \frac{E[number\ of\ training\ points\ that\ are\ support\ vectors]}{(number\ of\ training\ vectors)-1}$

This bound also holds in the multi-class case for the voting scheme methods (one-against-rest and one-against-one) and for our multi-class support vector method.

Softmax classifier

The Softmax classifier is the generalization of ''Binary Logistic Regression classifier''(to multiple classes).
SVM: treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class
loss: hinge loss

Softmax: gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation.
loss: cross-entropy loss
$L_i = -log(\frac{e^{f_{y_i}}}{\sum_je^{f_j}})$ or equivalently $L_i = -f_{y_i}+log\sum_je^{f_j}$

softmax function: $f_j(z) = \frac{e^{z_j}}{\sum_ke^{z_k}}$

1. Information theory view.

The cross-entropy between a "true" distribution $p$ and an estimated distribution $q$ is defined as:

$H(p,q) = −\sum_xp(x)logq(x)$
The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities (

$q=e^{f_{y_i}}/∑_je^{f_j}$ as seen above)
Notice that the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as

$H(p,q)=H(p)+D_{KL}(p||q)$
and the entropy of the delta function

$p$ is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance)

2.Probabilistic interpretation

To see this, remember that the Softmax classifier interprets the scores inside the output vector $f$ as the unnormalized log probabilities. $Exponentiating$ these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one.

In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class(just like performing Maximum Likelihood Estimation (MLE)).

Practise Notation:

The intermediate terms $e^{f_{y_i}}$ and $\sum_je^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable. If we multiply the top and bottom of the fraction by a constant $C$ and push it into the sum, we get the following (mathematically equivalent) expression:

$\frac{e^{f_{y_i}}}{∑_je^{f_j}}=\frac{Ce^{f_{y_i}}}{C∑_je^{f_j}}=\frac{e^{f_{y_i}}+logC}{∑_je^{f_j}+logC}$
This will not change any of the results, but we can use this value to improve the numerical stability of the computation.

A common choice for $C$ : $logC=−max_jf_j$ .
This simply states that we should shift the values inside the vector $f$ so that the highest value is zero.

Comparison of SVM vs. Softmax

The difference in the interpretation of the scores in $f$ :
SVM: interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores.
Softmax: The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).

The difference in the meanings:
SVM: which computes uncalibrated and not easy to interpret scores for all classes
Softmax: classifier allows us to compute ''probabilities'' for all labels.
The reason we put the word "probabilities" in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength $\lambda$ :
eg:

$[1,−2,0]→[e^1,e^{−2},e^0]=[2.71,0.14,1]→[0.7,0.04,0.26]$
then change every value to half of the original:

$[0.5,−1,0]→[e^0.5,e^{−1},e^0]=[1.65,0.37,1]→[0.55,0.12,0.33]$
the probabilites are now more diffuse.Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength

$\lambda$ , the output probabilities would be near uniform.

Compared to the Softmax classifier, the SVM is a more local objective,
The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero.
Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.

Summary

SVM and Softmax are parametric approach:
Unlike kNN classifier, the advantage of this parametric approach is that once we learn the parameters we can discard the training data.

equivalent jargon

hinge loss == max-margin loss