@Gwater 2017-08-06T08:22:58.000000Z 字数 3535 阅读 226

Kernel Method

machine_learning

Many linear parametric models can be reformulated into 'dual representation' in which the predictions are also based on linear combinations of a kernel function.As we shall see,for models which are based on nonlinear feature space $\phi(X)$ ,the kernel function is given by the relation .

$\begin{equation} k(X,X^{'})=\phi(X)^T\phi(X^{'}) \end{equation}$
so that ,

$k(X,X^{'}) = k(X^{'},X)$ .

1.Dual Representations

Many linear model for regression and calssification can be reformulated in terms of dual representation in which the kernel function arises naturally.This concept will play an important role when we consider support vector machines in next chapter.Here we consider a linear regression model determinied its parameters minimixing a regularized sum-of-squares error function .

$\begin{align} J(w) = \frac{1}{2}\sum_{i=1}^m(w^T\phi(X_i)-y_i)^2 + \frac{\lambda}{2}w^Tw \\ \nabla J(w) = \sum_{i=1}^m(w^T\phi(X_i)-y_i)\phi(X_i) + \lambda w \\ \end{align}$
Let gradient of error function J equal to zero :

$\begin{equation} w = -\frac{1}{\lambda}\sum_{i=1}^m\{w^T\phi(X^i)-y_i\}\phi(X_i) \end{equation}$
now , we define

$a_i=-\frac{1}{\lambda}\{w^T\phi(X^i)-y_i\}$ . so ,we can get this equation:

$\begin{equation} w = \sum a_i\phi(X_i) = \Phi^T a \end{equation}$
Instead of working with parameter vector w , we reformulate the J in terms of parameter vector a . so we can obtain

$\begin{align} J(a)& =\frac{1}{2}\sum_{i=1}^m(a^T\Phi\phi(X_i)-y_i)^2 + \frac{\lambda}{2}a^T\Phi\Phi^Ta \\ & = \frac{1}{2}\sum\{a^T\Phi\phi(X_i)\phi(X_i)^T\Phi^Ta\}-a^T\Phi\Phi^Ty+\frac{1}{2}y^Ty \\ & +\frac{\lambda}{2}a^T\Phi\Phi^Ta \\ & = \frac{1}{2}a^T\Phi\Phi^T\Phi\Phi^Ta-a^T\Phi\Phi^Ty+\frac{1}{2}y^Ty+\frac{\lambda}{2}a^T\Phi\Phi^Ta \\ \end{align}$
We now define the Gram matrix

$K = \Phi\Phi^T$ , then we can easily get

$K= K^T$ .In terms of Gram matrix , the sum of squares error function can be written as :

$\begin{equation} J(a)=\frac{1}{2}a^TKKa-a^TKy+\frac{1}{2}y^Ty+\frac{\lambda}{2}a^TKa \end{equation}$
Set the gradient of J with the respect to a equal to zero ,we can get :

$\begin{equation} \nabla J(a) =KKa-Ky+\lambda Ka =0 \\ (K+\lambda I)a = y \\ a = (K+\lambda I)^{-1}y \end{equation}$
If we substitute this back into the linear regression model,we obtain the following prediction for a new input X:

$\begin{equation} y(X)=w^T\phi(X)=a^T\Phi\phi(X) \\ = \phi(X)^T\Phi^Ta=k(X)(K+\lambda I)^{-1}y \end{equation}$
Thus , we see that the dual formulation allows the solution to the least-squares problem expressed entirely in terms of kernel function . However , in the dual formulation we determine the K by scalar

$\Phi\Phi^t$ ,which is a m*m matric . In the original parameter space formulation we get a n*n matrix . Generally , m is much larger than n ,so the dual formulation does not seem to be a good method.
Now ,

$k(X)=k(x',x)=x^Tx'$ , this kernel function means linear kernel . A linear model can be expressed in terms of dual formulation ,in which we can easily introduce kernel method . So when we change the kernel function ,what would happend to our model .

2.Constructing Kernels

In order to exploit the function of kernel , we try to substitute linear kernel to other term.As a simple example , consider a kernel function given by :

$\begin{equation} k(x,z)=(x^Tz)^2 \end{equation}$
If we take the particular case of a two-dimensional input vector x =

$(x_1,x_2)$ ,we can expand out the terms and identify the corrspending nonlinear feature mapping .

$\begin{equation} k(x,z) = (x_1z_1+x_2z_2)^2=(x_1^2,\sqrt2x_1x_2,x_2^2)(z_1^2,\sqrt2z_1z_2,z_2^2) \end{equation}$
so ,

$\phi(x)=(x_1^2,\sqrt2x_1x_2,x_2^2)$ ,it include all possible second order terms and with a particular weight between them. we get the high scalar feature ,so that a linear model can be trainned to solve a nonlinear problem.
I consider kernel function is a pre-treatment of feature space , choose a adaptive kernel function just like choosing the feature map.It can also be thought as finding a hyperplane in hign-dimensional space to classificate the sample ,which can not be seperated linearly in its original space .s

Kernel Method

1.Dual Representations

2.Constructing Kernels

内容目录