@betasy 2016-09-25T10:06:25.000000Z 字数 4869 阅读 1369

Machine Learning 学习笔记（第一周）

机器学习 吴恩达

Machine Learning 学习笔记（第一周）

1. welcome to machine learning

(quote) machine learning is the science of getting computers to learn, without being explicitly programmed.
Machine Learning at Stanford Visit ml-class.org to enroll
leacture slide:
Machine Learning
- Grew out of work in AI
- New capability for computers
Examples:
- Database mining
Large datasets from growth of automation/web.
E.g., Web click data ,medical records, biology, engineering
- Applications can't program by hand.
E.g., Autonomous helicopter, handwriting recognition, most of Natural Language Processing(NLP), Computer Vision.
- Self-customizing programs
E.g., Amazon, Netflix product recommendations
- Understanding human learning(brain, real AI)

2.Introduction-Supervised Learning

Housing price prediction
supervised learning is often called regression problems:predict continous values
Estimate Breast Cancer probability(malignant or benign):predict discontinous values
features:clump thickness,uniformity of cell size, uniformity of cell shape
key words:features, infinity features
(note)in supervised learning, in every example in our dataset, we are told what is the "correct answer" that we would have quite liked the algorithms have predicted on that example.

3.Unsupervised Learning

no labels or features
there is a dataset, and can you find the structure

cluster algorithm, e.g.google news gather news in simillar topics
applications:
1. organize large computer clusters, and try to figure out which machine should work together
2. social network analysis
3. market segmentation
4. astronomical data analysis

example: cocktail party problem

use Octave to code

4.Linear Regression with One Variable

Regression Problem
Predict real-valued output
training set <--> test set
$(x,y)$ one training sample
$(x^{（i）},y^{（i)}）$ , i th training sample
Process of learning algorithms:
Training Set------> Learning Algorithm------>h(hypothesis):input x, output estimated_y
h is a function that maps from x to y
How to represent h?
in this linear regression problem,

$h_{\theta}(x)=\theta_0+\theta_1x$
a linear regression with one variable(univariate linear regression).
Cost Function
problem to choose(calculate) parameters $\theta_0, \theta_1$
idea:choose $\theta_0,\theta_1$ so that $h_\theta (x)$ is close to $y$ for our training samples $(x,y)$
it's a minimizing problem, let's see:

$\min_{\theta_0,\theta_1}\{{\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2}\}$
$m$ is the size of training set
above formula is the overall objective function for linear regression
rewrite:

$J(\theta_0,\theta_1)=\{{\frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2}\}$

$\min_{\theta_0,\theta_1}J(\theta_0,\theta_1)$
and $J(\theta_0,\theta_1)$ is called the cost function, and also called the squared error function
Cost Function -- intuition I
Cost Function -- intuition II
contour plot: two parameters for $J(\theta_0,\theta_1)$

5.Parameter Learning

Gradient Descent
gradient descent is going to minimize the cost function $J(\theta_0,\theta_1)$
Have some function $J(\theta_0,\theta_1)$
want
$\min_{\theta_0,\theta_1}J(\theta_0,\theta_1)$
outline :
start with some $\theta_0,\theta_1$
keep changing $\theta_0,\theta_1$ , to reduce $J(\theta_0,\theta_1)$
until we hopefully end up at a minimum
Gradient Descent Algorithm

$repeat\; until\; convergence{ \theta_j:\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)\quad (for\; j=0 and j=1) }$
correct: Simultaneous update

$temp0 :=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)\\ temp1 :=\theta_1-\alpha\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)\\ \theta_0:=temp0\\ \theta_1:=temp1$
$\alpha$ is called the learning rate
Gradient Descent -- intuition
gradient descent can converge to a local minimum, even with the learning rate $\alpha$ fixed
$\theta_1 :=\theta_1-\alpha\frac{d}{d\theta_1}J(\theta_0,\theta_1)$
as we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time
Gradient Descent for linear regression
for the linear regression model, the partial derivative of $J(\theta_0,\theta_1)$ is:

$\frac{\partial}{\partial\theta_j}J(\theta_0,\theta_1)=\frac{\partial}{\partial\theta_j}*\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)})-y^{(i)})^2$

$for\; j=0:\frac{\partial}{\partial\theta_0}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})$

$for\; j=1:\frac{\partial}{\partial\theta_1}J(\theta_0,\theta_1)=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})*x^{(i)}$

so put the slope back into the gradient algorithm:

$repeat\; until\; convergence\{\\ \theta_0:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})\\ \theta_1:=\theta_1-\alpha\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})*x^{(i)}\\ \}$

it turns out to be that the cost function of linear regression model is often a convex function, which doesn't have any local optimum, except for one global optimum

"Batch" Gradient Descent
"Batch":each step of gradient descent uses all trainingexamples
1. Linear Algebra review(optional)
Matrices and vectors
Addition and scalar multiplication
Matrix Vector multiplication
Matrix Matrix multiplication
Matrix multiplication properties
1. let A and B be matrices. Then in general, $A\times B\ne B\times A$
2. Identity Matrix $I_{n\times n}$
Inverse and Transpose
sigular and degenerate matrices