[关闭]
@spiritnotes 2016-03-06T05:55:58.000000Z 字数 12056 阅读 2165

PyCon 2015 Scikit-learn Tutorial by Jake VanderPlas

sklearn


An Introduction to scikit-learn: Machine Learning in Python

Instructor: Jake VanderPlas
- email: jakevdp@uw.edu
- twitter: @jakevdp
- github: jakevdp

01-Preliminaries

确认软件是否已经正常安装

02.1-Machine-Learning-Intro

What is Machine Learning?

SGD分类

  1. from sklearn.datasets.samples_generator import make_blobs
  2. from sklearn.linear_model import SGDClassifier
  3. clf = SGDClassifier(loss="hinge", alpha=0.01, n_iter=200, fit_intercept=True)

Representation of Data in Scikit-learn

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]
n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
n_features: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases.

A Simple Example: the Iris Dataset

  1. from sklearn.datasets import load_iris
  2. iris = load_iris()

可视化

  1. import numpy as np
  2. import matplotlib.pyplot as plt
  3. x_index = 0
  4. y_index = 1
  5. # this formatter will label the colorbar with the correct target names
  6. formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
  7. plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
  8. c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
  9. plt.colorbar(ticks=[0, 1, 2], format=formatter)
  10. plt.clim(-0.5, 2.5)
  11. plt.xlabel(iris.feature_names[x_index])
  12. plt.ylabel(iris.feature_names[y_index]);

Other Available Data

02.2-Basic-Principles

The Scikit-learn Estimator Object

Estimator
from sklearn.linear_model import LinearRegression
Estimator parameters
All the parameters of an estimator can be set when it is instantiated, and have suitable default values
print(model);print(model.normalize)
fit
model.residues_

Supervised Learning: Classification and Regression

  1. knn = neighbors.KNeighborsClassifier(n_neighbors=5)
  2. X_fit = np.linspace(0, 1, 100)[:, np.newaxis]
  3. y_fit = model.predict(X_fit)
  4. from sklearn.ensemble import RandomForestRegressor
  5. model = RandomForestRegressor()

Unsupervised Learning: Dimensionality Reduction and Clustering

  1. from sklearn.decomposition import PCA
  2. pca = PCA(n_components=2)
  3. pca.fit(X)
  4. X_reduced = pca.transform(X)
  5. pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='RdYlBu')
  6. for component in pca.components_:
  7. print(" + ".join("%.3f x %s" % (value, name)
  8. for value, name in zip(component,
  9. iris.feature_names)))
  1. from sklearn.cluster import KMeans
  2. k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans
  3. k_means.fit(X)
  4. y_pred = k_means.predict(X)
  5. pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred,
  6. cmap='RdYlBu');

Recap: Scikit-learn's estimator interface

Scikit-learn strives to have a uniform interface across all methods, and we'll see examples of these below. Given a scikit-learn estimator object named model, the following methods are available:

Model Validation

从训练数据到泛化到未知数据

  1. #混淆矩阵
  2. from sklearn.metrics import confusion_matrix
  3. print(confusion_matrix(y, y_pred))
  4. #训练测试数据集划分
  5. from sklearn.cross_validation import train_test_split
  6. Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)

Quick Application: Optical Character Recognition

  1. #加载数据
  2. from sklearn import datasets
  3. digits = datasets.load_digits()
  4. #可视化
  5. fig, axes = plt.subplots(10, 10, figsize=(8, 8))
  6. fig.subplots_adjust(hspace=0.1, wspace=0.1)
  7. for i, ax in enumerate(axes.flat):
  8. ax.imshow(digits.images[i], cmap='binary')
  9. ax.text(0.05, 0.05, str(digits.target[i]),
  10. transform=ax.transAxes, color='green')
  11. ax.set_xticks([])
  12. ax.set_yticks([])

Unsupervised Learning: Dimensionality Reduction

  1. from sklearn.manifold import Isomap
  2. plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
  3. edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
  4. plt.colorbar(label='digit label', ticks=range(10))
  5. plt.clim(-0.5, 9.5)

Classification on Digits

使用逻辑回归

03.1-Classification-SVMs

Support Vector Machines: Maximizing the Margin

  1. from sklearn.svm import SVC # "Support Vector Classifier"
  2. clf = SVC(kernel='linear')
  3. clf.fit(X, y)
  4. clf = SVC(kernel='rbf')
  5. clf.fit(X, y)

03.2-Regression-Forests

Motivating Random Forests: Decision Trees

下载 (2).png-38.3kB

  1. # 产生数据
  2. from sklearn.datasets import make_blobs
  3. X, y = make_blobs(n_samples=300, centers=4,
  4. random_state=0, cluster_std=1.0)
  5. plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');

Decision Trees and over-fitting

Ensembles of Estimators: Random Forests

04.1-Dimensionality-PCA

Dimensionality Reduction: Principal Component Analysis in-depth

  1. #产生数据
  2. np.random.seed(1)
  3. X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T
  4. plt.plot(X[:, 0], X[:, 1], 'o')
  5. plt.axis('equal');
  6. #训练
  7. from sklearn.decomposition import PCA
  8. pca = PCA(n_components=2)
  9. pca.fit(X)
  10. print(pca.explained_variance_)
  11. print(pca.components_)
  12. #可视化
  13. plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)
  14. for length, vector in zip(pca.explained_variance_, pca.components_):
  15. v = vector * 3 * np.sqrt(length)
  16. plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)
  17. plt.axis('equal');
  18. #
  19. clf = PCA(0.95) # keep 95% of variance
  20. X_trans = clf.fit_transform(X)
  21. print(X.shape)
  22. print(X_trans.shape)
  23. X_new = clf.inverse_transform(X_trans)
  24. plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.2)
  25. plt.plot(X_new[:, 0], X_new[:, 1], 'ob', alpha=0.8)
  26. plt.axis('equal');

04.2-Clustering-KMeans

  1. # 产生数据
  2. from sklearn.datasets.samples_generator import make_blobs
  3. X, y = make_blobs(n_samples=300, centers=4,
  4. random_state=0, cluster_std=0.60)
  5. plt.scatter(X[:, 0], X[:, 1], s=50);
  6. # 聚类
  7. from sklearn.cluster import KMeans
  8. est = KMeans(4) # 4 clusters
  9. est.fit(X)
  10. y_kmeans = est.predict(X)
  11. plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');
  12. # 可视化
  13. from fig_code import plot_kmeans_interactive
  14. plot_kmeans_interactive();

Example: KMeans for Color Compression

  1. # reduce the size of the image for speed
  2. image = china[::3, ::3]
  3. n_colors = 64
  4. X = (image / 255.0).reshape(-1, 3)
  5. model = KMeans(n_colors)
  6. labels = model.fit_predict(X)
  7. colors = model.cluster_centers_
  8. new_image = colors[labels].reshape(image.shape)
  9. new_image = (255 * new_image).astype(np.uint8)
  10. # create and plot the new image
  11. with sns.axes_style('white'):
  12. plt.figure()
  13. plt.imshow(image)
  14. plt.title('input')
  15. plt.figure()
  16. plt.imshow(new_image)
  17. plt.title('{0} colors'.format(n_colors))

04.3-Density-GMM

Here we'll explore Gaussian Mixture Models, which is an unsupervised clustering & density estimation technique.

  1. np.random.seed(2)
  2. x = np.concatenate([np.random.normal(0, 2, 2000),
  3. np.random.normal(5, 5, 2000),
  4. np.random.normal(3, 0.5, 600)])
  5. plt.hist(x, 80, normed=True)
  6. plt.xlim(-10, 20);
  7. from sklearn.mixture import GMM
  8. clf = GMM(4, n_iter=500, random_state=3).fit(x)
  9. xpdf = np.linspace(-10, 20, 1000)
  10. density = np.exp(clf.score(xpdf))
  11. plt.hist(x, 80, normed=True, alpha=0.5)
  12. plt.plot(xpdf, density, '-r')
  13. plt.xlim(-10, 20);
  14. # Note that this density is fit using a mixture of Gaussians, which we can examine by looking at the means_, covars_, and weights_ attributes:
  15. clf.means_
  16. clf.covars_
  17. clf.weights_
  18. plt.hist(x, 80, normed=True, alpha=0.3)
  19. plt.plot(xpdf, density, '-r')
  20. for i in range(clf.n_components):
  21. pdf = clf.weights_[i] * stats.norm(clf.means_[i, 0],
  22. np.sqrt(clf.covars_[i, 0])).pdf(xpdf)
  23. plt.fill(xpdf, pdf, facecolor='gray',
  24. edgecolor='none', alpha=0.3)
  25. plt.xlim(-10, 20);

How many Gaussians?

Given a model, we can use one of several means to evaluate how well it fits the data. For example, there is the Aikaki Information Criterion (AIC) and the Bayesian Information Criterion (BIC)

  1. print(clf.bic(x))
  2. print(clf.aic(x))
  3. n_estimators = np.arange(1, 10)
  4. clfs = [GMM(n, n_iter=1000).fit(x) for n in n_estimators]
  5. bics = [clf.bic(x) for clf in clfs]
  6. aics = [clf.aic(x) for clf in clfs]
  7. plt.plot(n_estimators, bics, label='BIC')
  8. plt.plot(n_estimators, aics, label='AIC')
  9. plt.legend();

Example: GMM For Outlier Detection

GMM is what's known as a Generative Model: it's a probabilistic model from which a dataset can be generated.
One thing that generative models can be useful for is outlier detection: we can simply evaluate the likelihood of each point under the generative model; the points with a suitably low likelihood (where "suitable" is up to your own bias/variance preference) can be labeld outliers.

  1. log_likelihood = clf.score_samples(y)[0]
  2. plt.plot(y, log_likelihood, '.k');
  3. detected_outliers = np.where(log_likelihood < -9)[0]
  4. print("true outliers:")
  5. print(true_outliers)
  6. print("\ndetected outliers:")
  7. print(detected_outliers)
  8. set(true_outliers) - set(detected_outliers)

Other Density Estimators

  1. from sklearn.neighbors import KernelDensity
  2. kde = KernelDensity(0.15).fit(x[:, None])
  3. density_kde = np.exp(kde.score_samples(xpdf[:, None]))
  4. plt.hist(x, 80, normed=True, alpha=0.5)
  5. plt.plot(xpdf, density, '-b', label='GMM')
  6. plt.plot(xpdf, density_kde, '-r', label='KDE')
  7. plt.xlim(-10, 20)
  8. plt.legend();

05-Validation

Validation Sets

  1. from sklearn.cross_validation import train_test_split
  2. X_train, X_test, y_train, y_test = train_test_split(X, y)
  3. from sklearn.metrics import accuracy_score
  4. knn.score(X_test, y_test)
  5. accuracy_score(y_test, y_pred)

Cross-Validation

  1. from sklearn.cross_validation import cross_val_score
  2. cv = cross_val_score(KNeighborsClassifier(1), X, y, cv=10)
  3. cv.mean()

Detecting Over-fitting with Validation Curves

  1. from sklearn.learning_curve import validation_curve
  2. def rms_error(model, X, y):
  3. y_pred = model.predict(X)
  4. return np.sqrt(np.mean((y - y_pred) ** 2))
  5. degree = np.arange(0, 18)
  6. val_train, val_test = validation_curve(PolynomialRegression(), X, y,
  7. 'polynomialfeatures__degree', degree, cv=7,
  8. scoring=rms_error)
  9. def plot_with_err(x, data, **kwargs):
  10. mu, std = data.mean(1), data.std(1)
  11. lines = plt.plot(x, mu, '-', **kwargs)
  12. plt.fill_between(x, mu - std, mu + std, edgecolor='none',
  13. facecolor=lines[0].get_color(), alpha=0.2)
  14. plot_with_err(degree, val_train, label='training scores')
  15. plot_with_err(degree, val_test, label='validation scores')
  16. plt.xlabel('degree'); plt.ylabel('rms error')
  17. plt.legend();

Detecting Data Sufficiency with Learning Curves

  1. from sklearn.learning_curve import learning_curve
  2. def plot_learning_curve(degree=3):
  3. train_sizes = np.linspace(0.05, 1, 20)
  4. N_train, val_train, val_test = learning_curve(PolynomialRegression(degree),
  5. X, y, train_sizes, cv=5,
  6. scoring=rms_error)
  7. plot_with_err(N_train, val_train, label='training scores')
  8. plot_with_err(N_train, val_test, label='validation scores')
  9. plt.xlabel('Training Set Size'); plt.ylabel('rms error')
  10. plt.ylim(0, 3)
  11. plt.xlim(5, 80)
  12. plt.legend()

从坐标Y可以看到不同的模型有不同的RMS错误;不同的模型参数有不同的稳定点,在该点后再添加样本其RMS不会有再下降了。

Summary

We've gone over several useful tools for model validation

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注