@HaomingJiang
2017-10-05T05:35:45.000000Z
字数 6220
阅读 2140
Booknotes
ESL
Variable Types: quantitative or qualitative； Corresponding tasks: regression or classification; the 3rd type is ordered categorical (low, mid, high).
Least Squares
is the intercept or bias.
-->
KNN
Problems in High dimension space:
1. Such neighborhoods are no longer “local.” In order to caputer the same fraction number of neighbors, the average distance increases exponantially with the degree of dimension.
2. Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample
e.g. cubic smoothing spline:
Can be cast in a Bayesian framework: The penalty J corresponds
to a log-prior, and PRSS(f; λ) the log-posterior distribution.
These adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionary D of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.
variables Xj can come from different sources:
- quantitative inputs;
- transformations of quantitative inputs, such as log, square-root or square
- basis expansions, such as X2 = X1^2, leading to a polynomial representation
- numeric or “dummy” coding of the levels of qualitative inputs
- interactions between variables
The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns in X.
Model Significancy: F-Statistics
The Gauss–Markov Theorem: the least squares estimates of the parameters β have the smallest variance among all linear unbiased estimates.
If the inputs , . Then , by looking at . Which leads to Regression by Successive Orthogonalization:
z0 = x0 = 1;
For j = 1,2,...,p:
Regress xj on z0,...z(j-1)
Regress y on zp to get beta_p
If is highly related to other , will be too small and will very unstable.
improving accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy
Interpretation: With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details.
leaps and bounds procedure
Forwardstepwise (greedy)
selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. With many candidate predictors, this might seem like a lot of computation; however, clever updating algorithms can exploit the QR decomposition for the current fit to rapidly establish the next candidate (Exercise 3.9).
Compared to BSS, a price is paid in variance for selecting the best subset of each size; forward stepwise is a more constrained search, and will have lower variance, but perhaps more bias.
Backward-stepwise selection
starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score
Like forward-stepwise regression. At each step the algorithm identifies the variable most correlated with the current residual.
Unlike forward-stepwise regression, none of the other variables are adjusted when a term is added to the model. As a consequence, forward stagewise can take many more than p steps to reach the least squares fit, and historically has been dismissed as being inefficient. It turns out that this “slow fitting” can pay dividends in high-dimensional problems.
Shrinkage methods are more continuous, and don’t suffer as much from high variability.
Just add penalty
Called weight decay in neural network.