@HaomingJiang 2017-10-04T21:35:45.000000Z 字数 6220 阅读 2585

ESL Booknotes

Booknotes ESL

Chapter2 Overview of Supervised Learning

2.1 Variable Types and Terminology

Variable Types: quantitative or qualitative； Corresponding tasks: regression or classification; the 3rd type is ordered categorical (low, mid, high).

2.2 Least Squares and Nearest Neighbors

Least Squares
$\hat{Y}=\hat{\beta}_0 + \sum(X_j\hat{\beta}_j)$
$\hat{\beta}_0$ is the intercept or bias.
$RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)$ --> $\hat{\beta} = (\mathbf{X^TX})^{-1}\mathbf{X}^T\mathbf{y}$
KNN
LS: low variance and potentially high bias; KNN: high variance and low bias.

Thecniques to improve

KNN with kernal methods, closer->heavier
In high-dimensional spaces the distance kernels are modified to emphasizesome variable more than others.
Local regression fits linear models by locally weighted least squares, rather than fitting constants locally.
Linear models fit to a basis expansion of the original inputs allow arbitrarily complex models. ( $x^2, log(x)$ etc)
Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.

2.3 Statistical Decision Theory

1. minimum square error loss (expected prediction error: $EPE(f)=E(Y-f(x))^2$ )
  ---(by conditioning on X)----> $f(x)=E(Y|X=x)$
KNN use $\hat{f}(x)=Ave(y_i|x_i\in N_k(s))$ , with expectation is approximated by averaging over sample data, and conditioning at a point is relaxed to conditioning on some region “close” to the target point.
In fact, under mild regularity conditions on the joint probability distribution $Pr(X,Y)$ , one can show that as $N,K \rightarrow \inf, s.t. k/N \rightarrow 0, \hat{f}(x) \rightarrow E(Y|X=x)$ , convergence rate drops when dimensions raise.
For LM, $EPE$ leads to $\beta = [E(X^TX)]^{-1}E(XY)$
if use L1 norm $f(x)=median(Y|X=x)$
when the output is a categorical variable G, use a matrix L to denote the error. After condition, $EPE = E_X(\sum_{k=1}^KL[\mathcal{G}_k, \hat{G}(X)]Pr(\mathcal{G}_k|X))$ . --> $\hat{G}(x) = argmin_{g\in \mathcal{G}}\sum_{k=1}^KL[\mathcal{G}_k, g]Pr(\mathcal{G}_k|X=x))$ . With 0-1 rule, it is the Bayes classifier, $\hat{G}(x) = max_{g∈G} Pr(g|X = x).$ . The error rate of the Bayes classifier is called
the Bayes rate.

2.4 Local Methods in High Dimensions

Problems in High dimension space:
1. Such neighborhoods are no longer “local.” In order to caputer the same fraction number of neighbors, the average distance increases exponantially with the degree of dimension.
2. Another consequence of the sparse sampling in high dimensions is that all sample points are close to an edge of the sample

2.5 Classes of Restricted Estimators

Roughness Penalty and Bayesian Methods

$PRSS(f; λ) = RSS(f) + λJ(f)$
e.g. cubic smoothing spline: $J(f)=\int f''(x)^2dx$
Can be cast in a Bayesian framework: The penalty J corresponds
to a log-prior, and PRSS(f; λ) the log-posterior distribution.

Kernel Methods and Local Regression

Basis Functions and Dictionary Methods

These adaptively chosen basis function methods are also known as dictionary methods, where one has available a possibly infinite set or dictionary D of candidate basis functions from which to choose, and models are built up by employing some kind of search mechanism.

Chapter3 Linear Methods for Regression

3.1 Linear Regression Models and Least Squares

variables Xj can come from different sources:
- quantitative inputs;
- transformations of quantitative inputs, such as log, square-root or square
- basis expansions, such as X2 = X1^2, leading to a polynomial representation
- numeric or “dummy” coding of the levels of qualitative inputs
- interactions between variables

$H = X(X^TX)^{-1}X^Ty$

The non-full-rank case occurs most often when one or more qualitative inputs are coded in a redundant fashion. There is usually a natural way to resolve the non-unique representation, by recoding and/or dropping redundant columns in X.

Model Significancy: F-Statistics

The Gauss–Markov Theorem: the least squares estimates of the parameters β have the smallest variance among all linear unbiased estimates.

If the inputs $X = (x_1,x_2,...,x_p)$ , $<x_i,x_j> = 0, \ for\ i \neq j$ . Then $\hat{\beta_i} = \frac{<x_i,y>}{<x_i,x_i>}$ , by looking at $(X^TX)^{-1}X^Ty$ . Which leads to Regression by Successive Orthogonalization:

z0 = x0 = 1;
For j = 1,2,...,p:
    Regress xj on z0,...z(j-1)
Regress y on zp to get beta_p

If $x_p$ is highly related to other $x_k$ , $z_p$ will be too small and $\hat{\beta}_p$ will very unstable.

$Var(\hat{\beta}_p) = \frac{\sigma^2}{||z_p||^2}$

3.2 Subset Selection

Movtivation

improving accuracy: the least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy
Interpretation: With a large number of predictors, we often would like to determine a smaller subset that exhibit the strongest effects. In order to get the “big picture,” we are willing to sacrifice some of the small details.

Methods

Best-Subset Selection:

leaps and bounds procedure

Forward- and Backward-Stepwise Selection:

Forwardstepwise (greedy)
selection starts with the intercept, and then sequentially adds into the model the predictor that most improves the fit. With many candidate predictors, this might seem like a lot of computation; however, clever updating algorithms can exploit the QR decomposition for the current fit to rapidly establish the next candidate (Exercise 3.9).
Compared to BSS, a price is paid in variance for selecting the best subset of each size; forward stepwise is a more constrained search, and will have lower variance, but perhaps more bias.
Backward-stepwise selection
starts with the full model, and sequentially deletes the predictor that has the least impact on the fit. The candidate for dropping is the variable with the smallest Z-score

Forward-Stagewise Regression

Like forward-stepwise regression. At each step the algorithm identifies the variable most correlated with the current residual.
Unlike forward-stepwise regression, none of the other variables are adjusted when a term is added to the model. As a consequence, forward stagewise can take many more than p steps to reach the least squares fit, and historically has been dismissed as being inefficient. It turns out that this “slow fitting” can pay dividends in high-dimensional problems.

3.3 Shrinkage Methods

Shrinkage methods are more continuous, and don’t suffer as much from high variability.

Ridge Regression

Just add penalty $\lambda ||\beta||_2^2$
Called weight decay in neural network.