@chanvee 2015-05-20T11:28:35.000000Z 字数 3113 阅读 4338

# Feature selection

Python 数据挖掘

## Removing features with low variance

[VarianceThreshold](VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.) 是特征选取的一种基本方法。它会去除掉特征中所有不满足某个阈值的特征。默认情况下，它会去除掉所有zero-variance的特征, 比如说那些在所有样本上取值都一样的特征。

Var[X]=p(1p)

%doctest_mode #removing >>> manually
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)

Exception reporting mode: Context
Doctest mode is: OFF

array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])


## Univariate feature selection

- SelectKBest 保留得分最高的K个特征
- SelectPercentile 保留用户指定的最高的百分比的特征
- 每个特征采用常用的单变量测试：false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
- GenericUnivariateSelect可以用来做单变量特征

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> X, y = iris.data, iris.target
>>> X.shape

(150, 4)

>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape

(150, 2)


- For regression: f_regression
- For classification: chi2 or f_classif

## L1-based feature selection

### Selecting non-zero coefficients

>>> from sklearn.svm import LinearSVC
>>> X, y = iris.data, iris.target
>>> X.shape

(150, 4)

>>> X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
>>> X_new.shape
(150, 3)

(150, 3)


## Tree-based feature selection

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> X, y = iris.data, iris.target
>>> X.shape

(150, 4)

>>> clf = ExtraTreesClassifier()
>>> X_new = clf.fit(X, y).transform(X)
>>> clf.feature_importances_

array([ 0.0574718 ,  0.06609667,  0.46177457,  0.41465696])

X_new.shape

(150, 2)


## Feature selection as part of a pipeline

clf = Pipeline([  ('feature_selection', LinearSVC(penalty="l1")),  ('classification', RandomForestClassifier())])clf.fit(X, y)

• 私有
• 公开
• 删除