@nanmeng
2016-04-29T15:53:16.000000Z
字数 4432
阅读 1271
notes
Notice that the regularization function is not a function of the data, it is only based on the weights.
For binary SVMs the expectation of the probability of commiting an error on a test example is bounded by the ratio of of the expectation of the number of training points that are support vectors to the number of examples in the training set
This bound also holds in the multi-class case for the voting scheme methods (one-against-rest and one-against-one) and for our multi-class support vector method.
The Softmax classifier is the generalization of ''Binary Logistic Regression classifier''(to multiple classes).
SVM: treats the outputs as (uncalibrated and possibly difficult to interpret) scores for each class
loss: hinge loss
Softmax: gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation.
loss: cross-entropy loss
or equivalently
softmax function:
The cross-entropy between a "true" distribution and an estimated distribution is defined as:
To see this, remember that the Softmax classifier interprets the scores inside the output vector as the unnormalized log probabilities. these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one.
In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class(just like performing Maximum Likelihood Estimation (MLE)).
The intermediate terms and may be very large due to the exponentials. Dividing large numbers can be numerically unstable. If we multiply the top and bottom of the fraction by a constant and push it into the sum, we get the following (mathematically equivalent) expression:
- A common choice for : .
This simply states that we should shift the values inside the vector so that the highest value is zero.
The difference in the interpretation of the scores in :
SVM: interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores.
Softmax: The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low).
The difference in the meanings:
SVM: which computes uncalibrated and not easy to interpret scores for all classes
Softmax: classifier allows us to compute ''probabilities'' for all labels.
The reason we put the word "probabilities" in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength :
eg:
Compared to the Softmax classifier, the SVM is a more local objective,
The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero.
Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better.
SVM and Softmax are parametric approach:
Unlike kNN classifier, the advantage of this parametric approach is that once we learn the parameters we can discard the training data.
hinge loss == max-margin loss