[关闭]
@spiritnotes 2016-03-07T14:27:27.000000Z 字数 8272 阅读 4307

机器学习实践 -- digits数据集

机器学习实践


获取

  1. from sklearn import datasets
  2. digits = datasets.load_digits()

其包含的是手写体的数字,从0到9。数据集总共有1797个样本,每个样本由64个特征组成,分别为其手写体对应的8×8像素表示,每个特征取值0~16。

  1. In [20]:data.shape, target.shape, data.min(), data.max()
  2. Out[20]:((1797, 64), (1797,), 0.0, 16.0)

该数据还提供了images表示,其与data中数据一致,只是转变为了8*8的二维数组表示。针对单个图像表示,可见图如下:

  1. plt.imshow(images[0], cmap='binary')

0.png-23.4kB
针对前100个样本,描述图像如下:

  1. fig, axes = plt.subplots(10, 10, figsize=(8, 8))
  2. fig.subplots_adjust(hspace=0.1, wspace=0.1)
  3. for i, ax in enumerate(axes.flat):
  4. ax.imshow(images[i], cmap='binary')
  5. ax.text(0.05, 0.05, str(target[i]), transform=ax.transAxes, color='green')
  6. ax.set_xticks([]) #清除坐标
  7. ax.set_yticks([])

100.png-126.6kB

分析

该问题是一个典型的分类问题。数据集有64个特征,而且特征仅表示像素位置上的强度值,没有特殊的物理意义,因此无法直接在64个维度上进行分析。可以先降维进行分析。

PCA降维

  1. from sklearn.decomposition import PCA
  2. pca = PCA(n_components=2)
  3. data_pca = pca.fit_transform(data)
  4. plt.scatter(data_pca[:, 0], data_pca[:, 1], c=target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10))
  5. plt.colorbar();

pca.png-85.2kB
从图中我们可以看到降维后大部分点比较聚集,能够区分的。部分点有交叉是由于其降维后导致的特征丢失,例如针对图中的(-10,-20)还原如图,位于0和6的交界处。

  1. data_ = pca.inverse_transform(array([[-10,-20]]))
  2. plt.imshow(data_[0].reshape((8,8)), cmap='binary')

06.png-23.8kB
对前100个图像进行降维还原如下图:

  1. data_repca = pca.inverse_transform(data_pca)
  2. images_repca = data_repca.copy()
  3. images_repca.shape = (1797, 8, 8)

repaca.png-117.7kB
可见其缺失丢失了较多特征。PCA降维的能量占比曲线如下,其2维能量占比为28.5%,还原度很低

  1. sb.set()
  2. pca_ = PCA().fit(data)
  3. plt.plot(np.cumsum(pca_.explained_variance_ratio_))
  4. plt.xlabel('number of components')
  5. plt.ylabel('cumulative explained variance');

IsoMap

  1. from sklearn.manifold import Isomap
  2. iso = Isomap(n_components=2)
  3. data_projected = iso.fit_transform(data)
  4. plt.scatter(data_projected[:, 0], data_projected[:, 1], c=target,edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
  5. plt.colorbar(label='digit label', ticks=range(10))
  6. plt.clim(-0.5, 9.5)

iso.png-65kB
从IsoMap图上可见不同的数字区分得更开,而易混淆的地方也正是近似的地方,如2和7、3和9。

模型选择

从数据上看线性分类即可解决该问题,我们可选的模型有KNN,逻辑回归,SVM,决策树,随机森林等。

KNN

  1. from sklearn.neighbors import KNeighborsClassifier
  2. from sklearn.grid_search import GridSearchCV
  3. clf = KNeighborsClassifier()
  4. n_neighbors = [1,2,3,5,8,10,15,20,25,30,35,40]
  5. weights = ['uniform','distance']
  6. param_grid = [{'n_neighbors': n_neighbors, 'weights': weights}]
  7. grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10)
  8. grid_search.fit(data, target)

其结果如下

  1. In [56]:
  2. grid_search.best_score_, grid_search.best_estimator_, grid_search.best_params_,
  3. Out[56]:
  4. (0.97829716193656091,
  5. KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
  6. metric_params=None, n_jobs=1, n_neighbors=3, p=2,
  7. weights='distance'),
  8. {'n_neighbors': 3, 'weights': 'distance'})

逻辑回归

  1. from sklearn.grid_search import GridSearchCV
  2. from sklearn.linear_model import LogisticRegression
  3. clf = LogisticRegression(penalty='l2')
  4. C = [0.1, 0.5, 1, 5, 10]
  5. param_grid = [{'C': C}]
  6. grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10)
  7. grid_search.fit(data, target)

执行结果:

  1. In [63]:
  2. grid_search.best_score_, grid_search.best_estimator_,grid_search.best_params_
  3. Out[63]:
  4. (0.9360044518642181,
  5. LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
  6. intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
  7. penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
  8. verbose=0, warm_start=False),
  9. {'C': 0.1})

SVM

  1. from sklearn.grid_search import GridSearchCV
  2. from sklearn.svm import SVC
  3. clf = SVC()
  4. C = [0.1, 0.5, 1, 5, 10]
  5. kernel = ['linear', 'poly', 'rbf']
  6. param_grid = [{'C': C, 'kernel':kernel}]
  7. grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10)
  8. grid_search.fit(data, target)

结果如下:

  1. In [73]:
  2. grid_search.best_score_, grid_search.best_estimator_, grid_search.best_params_
  3. Out[73]:
  4. (0.97885364496382865, SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  5. decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  6. max_iter=-1, probability=False, random_state=None, shrinking=True,
  7. tol=0.001, verbose=False), {'C': 0.1, 'kernel': 'poly'})
  8. In [74]:
  9. grid_search.grid_scores_
  10. Out[74]:
  11. [mean: 0.96105, std: 0.02191, params: {'kernel': 'linear', 'C': 0.1},
  12. mean: 0.97885, std: 0.01931, params: {'kernel': 'poly', 'C': 0.1},
  13. mean: 0.10184, std: 0.00153, params: {'kernel': 'rbf', 'C': 0.1},
  14. mean: 0.96105, std: 0.02191, params: {'kernel': 'linear', 'C': 0.5},
  15. mean: 0.97885, std: 0.01931, params: {'kernel': 'poly', 'C': 0.5},
  16. mean: 0.11130, std: 0.00709, params: {'kernel': 'rbf', 'C': 0.5},
  17. mean: 0.96105, std: 0.02191, params: {'kernel': 'linear', 'C': 1},
  18. mean: 0.97885, std: 0.01931, params: {'kernel': 'poly', 'C': 1},
  19. mean: 0.48692, std: 0.06936, params: {'kernel': 'rbf', 'C': 1},
  20. mean: 0.96105, std: 0.02191, params: {'kernel': 'linear', 'C': 5},
  21. mean: 0.97885, std: 0.01931, params: {'kernel': 'poly', 'C': 5},
  22. mean: 0.52031, std: 0.06315, params: {'kernel': 'rbf', 'C': 5},
  23. mean: 0.96105, std: 0.02191, params: {'kernel': 'linear', 'C': 10},
  24. mean: 0.97885, std: 0.01931, params: {'kernel': 'poly', 'C': 10},
  25. mean: 0.52031, std: 0.06315, params: {'kernel': 'rbf', 'C': 10}]

决策树

  1. from sklearn.grid_search import GridSearchCV
  2. from sklearn.tree import DecisionTreeClassifier
  3. clf = DecisionTreeClassifier()
  4. criterion = ['gini','entropy']
  5. max_depth = [10, 15, 20, 30, None]
  6. min_samples_split = [2, 3, 5, 8, 10]
  7. min_samples_leaf = [1, 2, 3, 5, 8]
  8. param_grid = [{'criterion': criterion, 'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}]
  9. grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10)
  10. grid_search.fit(data, target)

结果如下:

  1. (0.83750695603784087,
  2. DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
  3. max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
  4. min_samples_split=2, min_weight_fraction_leaf=0.0,
  5. presort=False, random_state=None, splitter='best'),
  6. {'criterion': 'entropy',
  7. 'max_depth': 10,
  8. 'min_samples_leaf': 1,
  9. 'min_samples_split': 2})

随机森林

  1. from sklearn.grid_search import GridSearchCV
  2. from sklearn.ensemble import RandomForestClassifier
  3. clf = RandomForestClassifier(random_state=0)
  4. n_estimators = [10, 20, 35, 50, 80, 100, 120, 150, 200]
  5. param_grid = [{'n_estimators': n_estimators}]
  6. grid_search = GridSearchCV(clf, param_grid=param_grid, cv=10)
  7. grid_search.fit(data, target)

结果为:

  1. (0.95325542570951582,
  2. RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
  3. max_depth=None, max_features='auto', max_leaf_nodes=None,
  4. min_samples_leaf=1, min_samples_split=2,
  5. min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=1,
  6. oob_score=False, random_state=0, verbose=0, warm_start=False),
  7. {'n_estimators': 120})

汇总

Classifier score std
KNN 0.978 0.016
LogisticRegression 0.936 0.032
SVC 0.979 0.019
DecisionTreeClassifier 0.837
RandomForestClassifier 0.953 0.019

结果分析

以SVC为例分析看下分错的样本形态。

  1. svc_ = SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0,
  2. decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  3. max_iter=-1, probability=False, random_state=None, shrinking=True,
  4. tol=0.001, verbose=False)
  5. from sklearn.cross_validation import train_test_split
  6. Xtrain, Xtest, ytrain, ytest = train_test_split(data, target, test_size=0.2,
  7. random_state=2)
  8. svc_.fit(Xtrain, ytrain)
  9. svc_.score(Xtest, ytest) ## 0.97499999999999998

查看混淆矩阵

  1. from sklearn.metrics import confusion_matrix
  2. print(confusion_matrix(ytest, svc_.predict(Xtest)))
  3. [[31 0 0 0 1 0 0 0 0 0]
  4. [ 0 44 0 0 0 0 0 0 0 0]
  5. [ 0 0 31 0 0 0 0 0 0 0]
  6. [ 0 0 0 35 0 0 0 0 1 0]
  7. [ 0 0 0 0 32 0 0 0 3 0]
  8. [ 0 0 0 0 0 42 0 0 0 1]
  9. [ 0 0 0 0 0 0 35 0 0 0]
  10. [ 0 0 0 0 0 0 0 40 0 0]
  11. [ 0 0 0 0 0 0 0 0 35 1]
  12. [ 0 0 0 0 0 1 0 0 1 26]]

打印判断错误的数字:

  1. Ypre = svc_.predict(Xtest)
  2. Xerror, Yerror, Ypreerror = Xtest[Ypre!=ytest], ytest[Ypre!=ytest], Ypre[Ypre!=ytest]
  3. Xerror_images = Xerror.reshape((len(Xerror), 8, 8))
  4. fig, axes = plt.subplots(3, 3, figsize=(8, 8))
  5. fig.subplots_adjust(hspace=0.1, wspace=0.1)
  6. for i, ax in enumerate(axes.flat):
  7. ax.imshow(Xerror_images[i], cmap='binary')
  8. ax.text(0.05, 0.05, str(Yerror[i]), transform=ax.transAxes, color='green')
  9. ax.text(0.05, 0.2, str(Ypreerror[i]), transform=ax.transAxes, color='red')
  10. ax.set_xticks([]) #清除坐标
  11. ax.set_yticks([])

error.png-52.1kB
而另外一种随机测试集划分其结果如下。
error2.png-15.7kB

聚类 K-Means

针对该数据集我们可以进行聚类,无监督学习。

  1. from sklearn.cluster import KMeans
  2. est = KMeans(n_clusters=10)
  3. pres = est.fit_predict(data)

可以通过查看质心来确认当前的分类结果

  1. fig = plt.figure(figsize=(8, 3))
  2. for i in range(10):
  3. ax = fig.add_subplot(2, 5, 1 + i, xticks=[], yticks=[])
  4. ax.imshow(est.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)

kmeansc.png-25.6kB

  1. trans = [7,0,3,6,1,4,8,2,9,5]
  2. error_indexs = np.zeros(len(data))
  3. for i in range(len(error_indexs)):
  4. if target[i] != trans[est.labels_[i]]:
  5. error_indexs[i] = 1
  6. error_indexs = error_indexs != 0
  7. Xerror, Yerror, Ypreerror = data[error_indexs], target[error_indexs], est.labels_[error_indexs]

错误了375个,错误率约为20.8%。对于非监督学习来说,准确率能有80%,很不错了。查看一下错误的聚类如下

  1. fig, axes = plt.subplots(10, 10, figsize=(8, 8))
  2. fig.subplots_adjust(hspace=0.1, wspace=0.1)
  3. for i, ax in enumerate(axes.flat):
  4. ax.imshow(Xerror[i].reshape(8,8), cmap='binary')
  5. ax.text(0.05, 0.05, str(Yerror[i]), transform=ax.transAxes, color='green')
  6. ax.text(0.05, 0.3, str(trans[Ypreerror[i]]), transform=ax.transAxes, color='red')
  7. ax.set_xticks([]) #清除坐标
  8. ax.set_yticks([])

km1.png-135.7kB
km2.png-127.6kB
km3.png-124.6kB
km4.png-94.8kB

code

github: https://github.com/spiritwiki/codes/tree/master/data_digits

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注