@hanxiaoyang 2017-01-05T08:16:29.000000Z 字数 16845 阅读 2876

# 机器学习算法一览，应用建议与解决思路

未分类

## 2.机器学习算法简述

• 监督学习算法

• 无监督学习

• 半监督学习

### 2.2 从算法的功能角度分类

#### 2.2.1 回归算法(Regression Algorithms)

• Ordinary Least Squares Regression (OLSR)
• Linear Regression
• Logistic Regression
• Stepwise Regression
• Locally Estimated Scatterplot Smoothing (LOESS)
• Multivariate Adaptive Regression Splines (MARS)

#### 2.2.2 基于实例的算法(Instance-based Algorithms)

• k-Nearest Neighbour (kNN)
• Learning Vector Quantization (LVQ)
• Self-Organizing Map (SOM)
• Locally Weighted Learning (LWL)

#### 2.2.3 决策树类算法(Decision Tree Algorithms)

• Classification and Regression Tree (CART)
• Iterative Dichotomiser 3 (ID3)
• C4.5 and C5.0 (different versions of a powerful approach)
• Chi-squared Automatic Interaction Detection (CHAID)
• M5
• Conditional Decision Trees

#### 2.2.4 贝叶斯类算法(Bayesian Algorithms)

• Naive Bayes
• Gaussian Naive Bayes
• Multinomial Naive Bayes
• Averaged One-Dependence Estimators (AODE)
• Bayesian Belief Network (BBN)
• Bayesian Network (BN)

#### 2.2.5 聚类算法(Clustering Algorithms)

• k-Means
• Hierarchical Clustering
• Expectation Maximisation (EM)

#### 2.2.6 关联规则算法(Association Rule Learning Algorithms)

• Apriori algorithm
• Eclat algorithm

#### 2.2.7 人工神经网络类算法(Artificial Neural Network Algorithms)

• Perceptron
• Back-Propagation
• Radial Basis Function Network (RBFN)

#### 2.2.8 深度学习(Deep Learning Algorithms)

• Deep Boltzmann Machine (DBM)
• Deep Belief Networks (DBN)
• Convolutional Neural Network (CNN)
• Stacked Auto-Encoders

#### 2.2.9 降维算法(Dimensionality Reduction Algorithms)

• Principal Component Analysis (PCA)
• Principal Component Regression (PCR)
• Partial Least Squares Regression (PLSR)
• Sammon Mapping
• Multidimensional Scaling (MDS)
• Linear Discriminant Analysis (LDA)
• Mixture Discriminant Analysis (MDA)
• Flexible Discriminant Analysis (FDA)

#### 2.2.10 模型融合算法(Ensemble Algorithms)

• Random Forest
• Boosting
• Bootstrapped Aggregation (Bagging)
• Stacked Generalization (blending)
• Gradient Boosted Regression Trees (GBRT)

### 2.3 机器学习算法使用图谱

scikit-learn作为一个丰富的python机器学习库，实现了绝大多数机器学习的算法，有相当多的人在使用，于是我这里很无耻地把machine learning cheat sheet for sklearn搬过来了，原文可以看这里。哈哈，既然讲机器学习，我们就用机器学习的语言来解释一下，这是针对实际应用场景的各种条件限制，对scikit-learn里完成的算法构建的一颗决策树，每一组条件都是对应一条路径，能找到相对较为合适的一些解决方法，具体如下：

## 3. 机器学习问题解决思路

• 拿到数据后怎么了解数据(可视化)
• 选择最贴切的机器学习算法
• 定位模型状态(过/欠拟合)以及解决方法
• 大量极的数据的特征分析与可视化
• 各种损失函数(loss function)的优缺点及如何选择

### 3.1 数据与可视化

#numpy科学计算工具箱import numpy as np#使用make_classification构造1000个样本，每个样本有20个featurefrom sklearn.datasets import make_classificationX, y = make_classification(1000, n_features=20, n_informative=2,                            n_redundant=2, n_classes=2, random_state=0)#存为dataframe格式from pandas import DataFramedf = DataFrame(np.hstack((X, y[:, None])),columns = range(20) + ["class"])

df[:6]

import matplotlib.pyplot as pltimport seaborn as sns#使用pairplot去看不同特征维度pair下数据的空间分布状况_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)plt.show()

import matplotlib.pyplot as pltplt.figure(figsize=(12, 10))_ = sns.corrplot(df, annot=False)plt.show()

### 3.2 机器学习算法选择

from sklearn.svm import LinearSVCfrom sklearn.learning_curve import learning_curve#绘制学习曲线，以确定模型的状况def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,                        train_sizes=np.linspace(.1, 1.0, 5)):    """    画出data在某模型上的learning curve.    参数解释    ----------    estimator : 你用的分类器。    title : 表格的标题。    X : 输入的feature，numpy类型    y : 输入的target vector    ylim : tuple格式的(ymin, ymax), 设定图像中纵坐标的最低点和最高点    cv : 做cross-validation的时候，数据分成的份数，其中一份作为cv集，其余n-1份作为training(默认为3份)    """    plt.figure()    train_sizes, train_scores, test_scores = learning_curve(        estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)    train_scores_mean = np.mean(train_scores, axis=1)    train_scores_std = np.std(train_scores, axis=1)    test_scores_mean = np.mean(test_scores, axis=1)    test_scores_std = np.std(test_scores, axis=1)    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,                     train_scores_mean + train_scores_std, alpha=0.1,                     color="r")    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,                     test_scores_mean + test_scores_std, alpha=0.1, color="g")    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",             label="Training score")    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",             label="Cross-validation score")    plt.xlabel("Training examples")    plt.ylabel("Score")    plt.legend(loc="best")    plt.grid("on")     if ylim:        plt.ylim(ylim)    plt.title(title)    plt.show()#少样本的情况情况下绘出学习曲线plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",                    X, y, ylim=(0.8, 1.01),                    train_sizes=np.linspace(.05, 0.2, 5))

#### 3.2.1 过拟合的定位与解决

• 增大样本量

#增大一些样本量plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0)",                    X, y, ylim=(0.8, 1.1),                    train_sizes=np.linspace(.1, 1.0, 5))

• 减少特征的量(只用我们觉得有效的特征)

plot_learning_curve(LinearSVC(C=10.0), "LinearSVC(C=10.0) Features: 11&14", X[:, [11, 14]], y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))

from sklearn.pipeline import Pipelinefrom sklearn.feature_selection import SelectKBest, f_classif# SelectKBest(f_classif, k=2) 会根据Anova F-value选出 最好的k=2个特征plot_learning_curve(Pipeline([("fs", SelectKBest(f_classif, k=2)), # select two features                               ("svc", LinearSVC(C=10.0))]), "SelectKBest(f_classif, k=2) + LinearSVC(C=10.0)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))

• 增强正则化作用(比如说这里是减小LinearSVC中的C参数)
正则化是我认为在不损失信息的情况下，最有效的缓解过拟合现象的方法。
plot_learning_curve(LinearSVC(C=0.1), "LinearSVC(C=0.1)", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))

from sklearn.grid_search import GridSearchCVestm = GridSearchCV(LinearSVC(),                    param_grid={"C": [0.001, 0.01, 0.1, 1.0, 10.0]})plot_learning_curve(estm, "LinearSVC(C=AUTO)",                     X, y, ylim=(0.8, 1.0),                    train_sizes=np.linspace(.05, 0.2, 5))print "Chosen parameter on 100 datapoints: %s" % estm.fit(X[:500], y[:500]).best_params_

• l2正则化，它对于最后的特征权重的影响是，尽量打散权重到每个特征维度上，不让权重集中在某些维度上，出现权重特别高的特征。
• 而l1正则化，它对于最后的特征权重的影响是，让特征获得的权重稀疏化，也就是对结果影响不那么大的特征，干脆就拿不着权重。

plot_learning_curve(LinearSVC(C=0.1, penalty='l1', dual=False), "LinearSVC(C=0.1, penalty='l1')", X, y, ylim=(0.8, 1.0), train_sizes=np.linspace(.05, 0.2, 5))

estm = LinearSVC(C=0.1, penalty='l1', dual=False)estm.fit(X[:450], y[:450])  # 用450个点来训练print "Coefficients learned: %s" % est.coef_print "Non-zero coefficients: %s" % np.nonzero(estm.coef_)[1]

Coefficients learned: [[ 0.          0.          0.          0.          0.          0.01857999   0.          0.          0.          0.004135    0.          1.05241369   0.01971419  0.          0.          0.          0.         -0.05665314   0.14106505  0.        ]]Non-zero coefficients: [5 9 11 12 17 18]

#### 3.2.2 欠拟合定位与解决

#构造一份环形数据from sklearn.datasets import make_circlesX, y = make_circles(n_samples=1000, random_state=2)#绘出学习曲线plot_learning_curve(LinearSVC(C=0.25),"LinearSVC(C=0.25)",X, y, ylim=(0.5, 1.0),train_sizes=np.linspace(.1, 1.0, 5))

f = DataFrame(np.hstack((X, y[:, None])), columns = range(2) + ["class"])_ = sns.pairplot(df, vars=[0, 1], hue="class", size=3.5)

• 调整你的特征(找更有效的特征！！)
比如说我们观察完现在的数据分布，然后我们先对数据做个映射：
# 加入原始特征的平方项作为新特征X_extra = np.hstack((X, X[:, [0]]**2 + X[:, [1]]**2))plot_learning_curve(LinearSVC(C=0.25), "LinearSVC(C=0.25) + distance feature", X_extra, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5))

• 使用更复杂一点的模型(比如说用非线性的核函数)
我们对模型稍微调整了一下，用了一个复杂一些的非线性rbf kernel：
from sklearn.svm import SVC# note: we use the original X without the extra featureplot_learning_curve(SVC(C=2.5, kernel="rbf", gamma=1.0), "SVC(C=2.5, kernel='rbf', gamma=1.0)",X, y, ylim=(0.5, 1.0), train_sizes=np.linspace(.1, 1.0, 5))

### 3.3 关于大数据样本集和高维特征空间

#### 3.3.1 大数据情形下的模型选择与学习曲线

SGDClassifier每次只使用一部分(mini-batch)做训练，在这种情况下，我们使用交叉验证(cross-validation)并不是很合适，我们会使用相对应的progressive validation：简单解释一下，estimator每次只会拿下一个待训练batch在本次做评估，然后训练完之后，再在这个batch上做一次评估，看看是否有优化。

#生成大样本，高纬度特征数据X, y = make_classification(200000, n_features=200, n_informative=25, n_redundant=0, n_classes=10, class_sep=2, random_state=0)#用SGDClassifier做训练，并画出batch在训练前后的得分差from sklearn.linear_model import SGDClassifierest = SGDClassifier(penalty="l2", alpha=0.001)progressive_validation_score = []train_score = []for datapoint in range(0, 199000, 1000):    X_batch = X[datapoint:datapoint+1000]    y_batch = y[datapoint:datapoint+1000]    if datapoint > 0:        progressive_validation_score.append(est.score(X_batch, y_batch))    est.partial_fit(X_batch, y_batch, classes=range(10))    if datapoint > 0:        train_score.append(est.score(X_batch, y_batch))plt.plot(train_score, label="train score")plt.plot(progressive_validation_score, label="progressive validation score")plt.xlabel("Mini-batch")plt.ylabel("Score")plt.legend(loc='best')  plt.show()                     

#### 3.3.2 大数据量下的可视化

#直接从sklearn中load数据集from sklearn.datasets import load_digitsdigits = load_digits(n_class=6)X = digits.datay = digits.targetn_samples, n_features = X.shapeprint "Dataset consist of %d samples with %d features each" % (n_samples, n_features)# 绘制数字示意图n_img_per_row = 20img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))for i in range(n_img_per_row):    ix = 10 * i + 1    for j in range(n_img_per_row):        iy = 10 * j + 1        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))plt.imshow(img, cmap=plt.cm.binary)plt.xticks([])plt.yticks([])_ = plt.title('A selection from the 8*8=64-dimensional digits dataset')plt.show()

#import所需的packagefrom sklearn import (manifold, decomposition, random_projection)rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)#定义绘图函数from matplotlib import offsetboxdef plot_embedding(X, title=None):    x_min, x_max = np.min(X, 0), np.max(X, 0)    X = (X - x_min) / (x_max - x_min)    plt.figure(figsize=(10, 10))    ax = plt.subplot(111)    for i in range(X.shape[0]):        plt.text(X[i, 0], X[i, 1], str(digits.target[i]),                 color=plt.cm.Set1(y[i] / 10.),                 fontdict={'weight': 'bold', 'size': 12})    if hasattr(offsetbox, 'AnnotationBbox'):        # only print thumbnails with matplotlib > 1.0        shown_images = np.array([[1., 1.]])  # just something big        for i in range(digits.data.shape[0]):            dist = np.sum((X[i] - shown_images) ** 2, 1)            if np.min(dist) < 4e-3:                # don't show points that are too close                continue            shown_images = np.r_[shown_images, [X[i]]]            imagebox = offsetbox.AnnotationBbox(                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),                X[i])            ax.add_artist(imagebox)    plt.xticks([]), plt.yticks([])    if title is not None:        plt.title(title)#记录开始时间start_time = time.time()X_projected = rp.fit_transform(X)plot_embedding(X_projected, "Random Projection of the digits (time: %.3fs)" % (time.time() - start_time))

PCA降维

from sklearn import (manifold, decomposition, random_projection)#TruncatedSVD 是 PCA的一种实现X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)#记录时间start_time = time.time()plot_embedding(X_pca,"Principal Components projection of the digits (time: %.3fs)" % (time.time() - start_time))

from sklearn import (manifold, decomposition, random_projection)#降维tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)start_time = time.time()X_tsne = tsne.fit_transform(X)#绘图plot_embedding(X_tsne,               "t-SNE embedding of the digits (time: %.3fs)" % (time.time() - start_time))

### 3.4 损失函数的选择

import numpy as npimport matplotlib.plot as plt# 改自http://scikit-learn.org/stable/auto_examples/linear_model/plot_sgd_loss_functions.htmlxmin, xmax = -4, 4xx = np.linspace(xmin, xmax, 100)plt.plot([xmin, 0, 0, xmax], [1, 1, 0, 0], 'k-',         label="Zero-one loss")plt.plot(xx, np.where(xx < 1, 1 - xx, 0), 'g-',         label="Hinge loss")plt.plot(xx, np.log2(1 + np.exp(-xx)), 'r-',         label="Log loss")plt.plot(xx, np.exp(-xx), 'c-',         label="Exponential loss")plt.plot(xx, -np.minimum(xx, 0), 'm-',         label="Perceptron loss")plt.ylim((0, 8))plt.legend(loc="upper right")plt.xlabel(r"Decision function $f(x)$")plt.ylabel("$L(y, f(x))$")plt.show()

• 0-1损失函数(zero-one loss)非常好理解，直接对应分类问题中判断错的个数。但是比较尴尬的是它是一个非凸函数，这意味着其实不是那么实用。
• hinge loss(SVM中使用到的)的健壮性相对较高(对于异常点/噪声不敏感)。但是它没有那么好的概率解释。
• log损失函数(log-loss)的结果能非常好地表征概率分布。因此在很多场景，尤其是多分类场景下，如果我们需要知道结果属于每个类别的置信度，那这个损失函数很适合。缺点是它的健壮性没有那么强，相对hinge loss会对噪声敏感一些。