@spiritnotes 2016-02-25T14:56:05.000000Z 字数 5656 阅读 1714

Coursera: Machine Learning

机器学习 公开课

1 介绍

机器学习

AI发展
计算机的新能力

例子：

数据挖掘
不能人工编码的应用，自动飞行器、手写体识别、NLP、计算机视觉
个性化推荐等
理解人类学习

监督学习

监督学习：输入为(data, right_answer)
回归：预测连续值
分类：预测离散值

非监督学习

发现数据内部的结构
例子：新闻聚类、社会网络分析、市场分组、太空数据分析、鸡尾酒晚会（两人说话，分析出来SVD）

单变量线性回归

假设表示： $h_\theta(x)=\theta_0+\theta_1x$
cost function: $J(\theta_0,\theta_1)={1\over 2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2$
梯度下降：

$\theta_j := \theta_j-\alpha{\partial\over \partial \theta_j}J(\theta_0,\theta_1)$

$\theta_0 = \theta_0-{\alpha\over m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})$

$\theta_1 = \theta_1-{\alpha\over m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x^{(i)}$

$\theta_0,\theta_1$ 同步更新后再计算下一次迭代

线性代数

矩阵: $R^{4×2};A_{ij};$
向量（vector）: $V^{n×1}$
矩阵+矩阵: $C=A+B\rightarrow C_{ij}=A_{ij}+B_{ij}$
矩阵×标量(Scalar multiplication): $B = k×A\rightarrow B_{ij}=k*A_{ij};{A\over k}={1\over k}×A$
矩阵×向量: $B^{N*1}=A^{N*M}×V^{M*1}\rightarrow B_{i}=\sum_{j=1}^MA_{ij}×V_{j}$
矩阵×矩阵: $C^{N*L}=A^{N*M}×B^{M*L}\rightarrow C_{il}=\sum_{j=1}^MA_{ij}×B_{jl}$
矩阵性质: $A×B\ne B×A;A×B×C=A×(B×C)$
Identity Matrix: $I*Z = Z*I = Z$
inverse: A is an m×m matrix； $AA^{-1}=A^{-1}A=I$ ；不是所有矩阵都有逆
Transpose，倒置: $A_{ij} = A_{ji}^T$

2 多变量线性回归

multiple features(variables)

设 $x_0=1$ ,则

$h_\theta(x)=\theta^Tx=\sum_{i=0}^n\theta_ix_i$

梯度下降

$\theta_j = \theta_j-\alpha{1\over m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$

$\theta=\theta-\alpha{1\over m}X^T(h_\theta(x)-y)$

feature scaling

使每个feature位于-1～1之间
$x = {x \over max}$

平均归一化
$x = {x-mean\over max - min}$

Learning Rate

保证每次迭代 $J(\theta)$ 都下降,使用较小rate
在 $J(\theta)$ 小于某个很小值时，认定收敛convergence
try choose： 3倍增加

非线性转换

房子的面积 = 长 × 宽

多项式回归: choose $x,x^2,x^3...$

Normal Equation

${1\over m}X^T(h_\theta(x)-y)=0\rightarrow$

$X^T(X\theta-y)=0\rightarrow$

$\theta=(X^TX)^{-1}X^Ty$

梯度下降	Normal Equation（m samples，n features）
需要选择 $\alpha$	不需要选择
需要多次迭代	不需要迭代
	需要计算 $(X^TX)^{-1};O(n^3)$
n很大也可以	n很大时计算很慢

3 逻辑回归

假设表征

sigmod function/logistic function: $sigmod = {1\over 1+e^{-z}}$
$h_\theta(x)={1\over 1+e^{-\theta^Tx}}=P(y=1|x;\theta)$ 表示y=1在输入x时的可能性

判决边界(Decision Boundary)

$y=1\leftarrow h(x)\ge 0.5, \theta^Tx\ge 0$
$y=0\leftarrow h(x)\lt 0.5, \theta^Tx\lt 0$

Decision Boundary: $\theta^Tx = 0$
Non-linear decision boundaries: $h_\theta(x)=g(\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1^2+\theta_4x_2^2...)$

cost function

使用均方作为cost function是非凸，不能保证全局最优

， 越 接 近 ， 该 值 越 大 ， 越 接 近 ， 该 值 越 大

$Cost(h_\theta(x),y) = \begin{cases} -\log(h_\theta(x)), & \text{if $y=1$，y越接近0，该值越大} \\ -\log(1 - h_\theta(x)), & \text{if $y=0$，y越接近1，该值越大} \end{cases}$

(log_a(x))'=1/(xlna)
${1\over {1\over h_\theta(x)}\ln 2}\rightarrow h_\theta(x)\rightarrow$

简化cost function，梯度下降

$Cost(h_\theta(x),y) =-y\log(h_\theta(x))-(1-y)\log(1 - h_\theta(x))$
计算可得：

$\theta_j=\theta_j-\alpha\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$
与线性回归一样是因为

$h=\theta^T x$ 与

$h={1\over 1+e^{-\theta^Tx}}$

高级优化

Conjugate gradient
BFGS
L-BFGS

不需要手动选择 $\alpha$
一般比梯度下降快
更加复杂

多分类

one-vs-all(one-vs-rest): 针对每个类别 $h^{c_i}_\theta(x)=P(y=c_i|x;\theta)$

三种数据以120度均分圆平面的数据会有分类错误。

4 神经网络：表征

非线性假设

把特征非线性化 $x_i*x_j\rightarrow o(n^2)\rightarrow {n^2\over 2}$

计算机视觉：判别图像是否为Car
50*50pixel->2500features if gray 7500if GRB

神经元与大脑

起源：模拟大脑
发展：80年代、90年代使用广泛
现在：很多应用的最高水平

one learning algorithm 假设，大脑中学习触觉、听、视觉都是同一种算法，例子：人回声听位（声呐）

表征

logsitic unit

sigmod（logistic） activation function
x0： bias unit

Neural Network

每层均有bias unit
Input Layer： 1
Hidden Layer： n
Output Layer： 1

$a_i^{(j)}=$ "activation" of unit i in layer j
$\Theta^{(j)}=$ matrix of weights controlling function mapping form layer j to layer j+1， $\Theta_{kl}^{(j)}，k表示j+1层第k节点，l表示第j层第l节点$

Layer j层 $s_j$ 个节点，j+1层 $s_{j+1}$ ，则 $\theta^{(j)}$ 的维度为 $s_{j+1}×(s_j+1)$

；

$a_i^{(j)}=g(\sum_{s=0}^S\theta_{is}^{(j-1)}a_s^{(j-1)}) \text{; j$\ge$2;i$\ge 1;a^{(j)}_0=1；a^{(1)}_i=x_i;$}$

神经网络学习它自己的特征，隐藏层可以有多层

例子

非线性分类例子 XOR/XNOR

$x_1,x_2 \in \{0,1\}$

$y=x_1 \space AND\space x_2$: $h_\theta(x)=g(-30+20x_1+20x_2)$
$y=x_1 \space OR\space x_2$: $h_\theta(x)=g(-10+20x_1+20x_2)$
$y=NOT \space x_1$: $h_\theta(x)=g(10-20x_1)$
$y=((NOT\space x_1) \space AND\space(NOT\space x_2))$: $h_\theta(x)=g(10-20x_1-20x_2)$
$y=x_1 \space XNOR\space x_2$: $(x_1\space AND\space x_2)\space OR\space ((NOT\space x_1) AND\space(NOT\space x_2))$

多分类

multiple output units：one-vs-all
$h_\theta(x)=[1,0,..,0]^T;h_\theta(x)=[0,1,..,0]^T...;$ y是向量而不是离散值

5 神经网络：学习

cost function

L：total no. of layer in network
sl：no. of units（not counting bias unit） in layer l
$h_\theta(x)\in \mathbb R^K,(h_\theta(x))_i=i^{th} output$

$J(\theta)=-{1\over m}[\sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log(h_\theta(x^{(i)}))_k+(1-y_k^{(i)})\log(1-(h_\theta(x^{(i)}))_k)]+{\lambda\over 2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\theta^{(l)}_{ji})^2$

Backpropagation algorithm

Forward propagation: $z^{(i+1)}=\theta^{(i)}a^{(i)};a^{(i+1)}=g(z^{(i+1)})\space(add\space a_0^{(i+1)})$
Intuition: $\delta_j^{(l)} =$ "error" of node j in layer l
For each output unit(Layer L=4): $\delta_j^{(4)}=a_j^{(4)}-y_i$
$\delta^{(4)}=a^{(4)}-y$
$\delta^{(3)}=(\theta^{(3)})^T\delta^{(4)} .*g'(z^{(3)})$
$\delta^{(2)}=(\theta^{(2)})^T\delta^{(3)} .*g'(z^{(2)})$
Set $\Delta_{ij}^{(l)}=0 (for\space all\space l,i,j)$
for i=1 to m: set $a^{(1)}=x^{(i)}$
Perform forward propagation to computer $a^{(l)}$ for l = 2,3,...L
Using $y^{(i)}$ ,compute $\delta^{(L)}=a^{(L)}-y^{(i)}$
Compute $\delta^{(L-1)},\delta^{(L-2)}...,\delta^{(2)}$
$\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}$

$D_ij^{(l)} := {1\over m}\Delta_{ij}^{(l)}+\lambda\theta_{ij}^{(l)}\text{ if $j\ne 0$}$
$D_ij^{(l)} := {1\over m}\Delta_{ij}^{(l)}\text{ if $j= 0$}$

unrolling parameters

Theta1，Theta2....
D1，D2...

Gradient checking

可以使用很小的变量，计算出导数， $\epsilon=10^{-4}$

${\partial\over \theta} J(\theta)= {J(\theta+\epsilon) - J(\theta-\epsilon)\over 2\epsilon}$

${\partial\over \theta} J(\theta)= {J(\theta+\epsilon) - J(\theta)\over \epsilon}$
每个

$\theta_j$ 增加分别计算获得gradApprox(j)，检查gradApprox 约等于 Dvec

Random Initialization

Zero initialization
Random initialization：Symmetry breaking
initialize each $\theta_{ij}^{(l)}$ to a random value in [ $-\epsilon,\epsilon$ ]
Theata1 = rand(n,m)*(2*INIT_EPSILON)-INIT_EPSILON

ALL

No. input units: x
No. output units: y
reasonable default: 1 hidden layer or if >1, have same no. of hidden units in every layer(usually the more the better)

Coursera: Machine Learning

1 介绍

机器学习

监督学习

非监督学习

单变量线性回归

线性代数

2 多变量线性回归

multiple features(variables)

梯度下降

feature scaling

Learning Rate

非线性转换

Normal Equation

3 逻辑回归

分类

假设表征

判决边界(Decision Boundary)

cost function

简化cost function，梯度下降

高级优化

多分类

4 神经网络：表征

非线性假设

神经元与大脑

表征

logsitic unit

Neural Network

例子

非线性分类例子 XOR/XNOR

多分类

5 神经网络：学习

cost function

Backpropagation algorithm

unrolling parameters

Gradient checking

Random Initialization

ALL

6 神经网络：设计

7 SVM

8 非监督学习

9 推荐

10 大规模机器学习

11 应用实例

内容目录