@zhouyy 2017-12-20T09:33:54.000000Z 字数 9135 阅读 549

OReilly Machine Learning

book

Definition and Senario

Machine Learning is about making machines get better at some task by learning from data, instead of having to explicitly code rules.

To summarize, Machine Learning is great for:
• Problems for which existing solutions require a lot of hand-tuning(手动） or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.（繁琐耗时）
• Complex problems for which there is no good solution at all using a traditionalapproach: the best Machine Learning techniques can find a solution.（目前无解）
• Fluctuating environments: a Machine Learning system can adapt to new data.（环境变动）
• Getting insights about complex problems and large amounts of data.（复杂问题）

Types

• Whether or not they are trained with human supervision (supervised, unsupervised, semisupervised, and Reinforcement Learning)
• Whether or not they can learn incrementally on the fly (online versus batch learning)
• Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning)

Supervised/Unsupervised Learning

There are four major categories: supervised learning, unsupervised learning, semisupervised learning, and Reinforcement Learning.

Supervised learning

The training data you feed to the algorithm includes the desired solutions, called labels; A typical supervised learning task is classification
Another typical task is to predict a target numeric value, such as the price of a car, given a set of features (mileage, age, brand, etc.) called predictors. This sort of task is called regression .

attribute Vs feature:: An attribute is a data type (e.g., “Mileage”),while a feature has several meanings depending on the context, but generally means an attribute plus its value (e.g., “Mileage = 15,000”). Many people use the words attribute and feature inter‐changeably, though.

Note that some regression algorithms can be used for classification as well, and vice versa
Some important algorithms:
• k-Nearest Neighbors
• Linear Regression
• Logistic Regression
• Support Vector Machines (SVMs)
• Decision Trees and Random Forests
• Neural networks

Unsupervised learning

The training data is unlabeled.
• Clustering
— k-Means
— Hierarchical Cluster Analysis (HCA)
— Expectation Maximization
• Visualization and dimensionality reduction
— Principal Component Analysis (PCA)
— Kernel PCA
— Locally-Linear Embedding (LLE)
— t-distributed Stochastic Neighbor Embedding (t-SNE)
• Association rule learning
— Apriori
— Eclat

Tasks:
1. Dimensionality reduction, in which the goal is to simplify the datawithout losing too much information. One way to do this is to merge several correla‐ted features into one. For example, a car’s mileage may be very correlated with its age,so the dimensionality reduction algorithm will merge them into one feature that represents the car’s wear and tear. This is called feature extraction.
2. Anomaly detection—for example, detecting unusual credit card transactions to prevent fraud, catching manufacturing defects,or automatically removing outliers from a dataset before feeding it to another learning algorithm.
3. Association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

Semisupervised learning

Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit of labeled data. This is called semisupervised learning
-combinations of unsupervised and supervised algorithms

Reinforcement Learning

The learning system, called an agent
in this context, can observe the environment, select and perform actions, and get rewards in return. It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

Batch and Online Learning

In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called off-ine learning.
If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset (not just the new data, but also the old data), then stop the old system and replace it with the new one.
This solution is simple and often works fine, but training using the full set of data can take many hours, so you would typically train a new system only every 24 hours or even just weekly. If your system needs to adapt to rapidly changing data (e.g., to predict stock prices), then you need a more reactive solution.

Online learning-incremantal learning 增量学习

In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches.Online learning algorithms can also be used to train systems on huge datasets that
cannot fit in one machine’s main memory (this is called out-of-core learning). The algorithm loads part of the data, runs a training step on that data, and repeats the process until it has run on all of the data 。
One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data (you don’t want a spam filter to flag only the latest kinds of spam it was shown).
Conversely, if you set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points.
A big challenge with online learning is that if bad data is fed to the system, the system’s performance will gradually decline.
To reduce this risk, you need to monitor your system closely and promptly switch learning off (and possibly revert to a previously working state) if you detect a drop in performance. You may also want to monitor the input data and react to abnormal data (e.g., using an anomaly detection algorithm).

Instance-Based Versus Model-Based Learning

One more way to categorize Machine Learning systems is by how they generalize. Most Machine Learning tasks are about making predictions. This means that given a number of training examples, the system needs to be able to generalize to examples it has never seen before. Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.
There are two main approaches to generalization: instance-based learning and model-based learning.

Instance-based learning

Instead of just flagging emails that are identical to known spam emails, your spam filter could be programmed to also flag emails that are very similar to known spam emails. This requires a measure of similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.
This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases using a similarity measure

Model-based learning

Another way to generalize from a set of examples is to build a model of these examples, then use that model to make predictions. This is called model-based learning
Although the data is noisy (i.e., partly random), it
looks like life satisfaction goes up more or less linearly as the country’s GDP per capita increases. So you decide to model life satisfaction as a linear function of GDP per capita. This step is called model selection: you selected a linear model of life satisfaction with just one attribute, GDP per capita

Challenge

The system will not perform well if your training set is too small, or if the data is not representative, noisy, or polluted with irrelevant features (garbage in, garbage out). Lastly, your model needs to be neither too simple (in which case it will
underfit) nor too complex (in which case it will overfit).

Insucient Quantity of Training Data （喂足够量数据）

As the authors put it: “these results suggest that we may want to reconsider the tradeoff between spending time and money on algorithm development versus spending it on corpus development.”

Nonrepresentative Training Data 训练数据不具代表性

It is crucial to use a training set that is representative of the cases you want to general‐ize to. This is often harder than it sounds: if the sample is too small, you will have sampling noise（采样噪声） (i.e., nonrepresentative data as a result of chance), but even very large samples can be nonrepresentative if the sampling method is flawed. This is called sampling bias（采样偏倚）.

Poor-Quality Data 数据错误或不完整

Your system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones.
• Feature selection: selecting the most useful features to train on among existing features.
• Feature extraction: combining existing features to produce a more useful one (as we saw earlier, dimensionality reduction algorithms can help).
• Creating new features by gathering new data.

Overfitting the Training Data 过度拟合

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are:
• To simplify the model by selecting one with fewer parameters
(e.g., a linear model rather than a high-degree polynomial
model), by reducing the number of attributes in the training
data or by constraining the model
• To gather more training data
• To reduce the noise in the training data (e.g., fix data errors and remove outliers)

在机器学习算法中，我们常常将原始数据集分为三部分：training data、validation data，testing data。这个validation data是什么？它其实就是用来避免过拟合的，在训练过程中，我们通常用它来确定一些超参数。那为啥不直接在testing data上做这些呢？因为如果在testing data做这些，那么随着训练的进行，我们的网络实际上就是在一点一点地overfitting我们的testing data，导致最后得到的testing accuracy没有任何参考意义。因此，training data的作用是计算梯度更新权重，validation data如上所述，testing data则给出一个accuracy以判断网络的好坏。

Underfitting the Training Data

It occurs when your model is too simple to learn the underlying structure of the data.
The main options to fix this problem are:
• Selecting a more powerful model, with more parameters
• Feeding better features to the learning algorithm (feature engineering)
• Reducing the constraints on the model (e.g., reducing the regularization hyper‐
parameter)