[关闭]
@spiritnotes 2016-03-14T10:05:09.000000Z 字数 1746 阅读 2906

机器学习实践 -- 波士顿房价

机器学习实践


获取

  1. from sklearn.datasets import load_boston
  2. boston = load_boston()

该数据集是回归算法所用,数据集的总体描述如下

  1. Data Set Characteristics:
  2. :Number of Instances: 506
  3. :Number of Attributes: 13 numeric/categorical predictive
  4. :Median Value (attribute 14) is usually the target

数据集中未有缺失数据。

分析

数据分析

通过作图可以看到房价与各个单变量之间的关系,可以看到有几个变量有可见的相关关系
1.png-46.1kB
2.png-67.4kB
3.png-53.7kB
4.png-19kB

回归

线性回归

  1. from sklearn.linear_model import LinearRegression
  2. from sklearn.cross_validation import KFold
  3. import numpy as np
  4. def get_rmse_of_regression(data, target, cv=10, fit_intercept=True):
  5. kf = KFold(len(data), n_folds=cv , shuffle=True)
  6. err_test_all, err_train_all = 0, 0
  7. for train, test in kf:
  8. lr = LinearRegression(fit_intercept = fit_intercept)
  9. lr.fit(data[train], target[train])
  10. pre_train = lr.predict(data[train])
  11. err_train = pre_train - target[train]
  12. err_train_all += np.sum(err_train*err_train)
  13. pre_test = lr.predict(data[test])
  14. err_test = pre_test - target[test]
  15. err_test_all += np.sum(err_test*err_test)
  16. rmse_test = np.sqrt(err_test_all/len(data))
  17. rmse_train = np.sqrt(err_train_all/(cv-1.0)/len(data))
  18. return rmse_test, rmse_train

采用线性回归通过10折法进行计算其训练集的RMSE约为4.8~5之间。
通过随机测试40,其RMSE结果如下所示,测试的RMSE相比训练集的RMSE略高,这个是过拟合现象,很正常。
rmse_rl.png-10.4kB

弹性网

我们采用弹性网再进行测试,其结果如下,可见其导致RSME有所上升。

  1. from sklearn.cross_validation import KFold
  2. import numpy as np
  3. from sklearn.linear_model import ElasticNetCV
  4. def get_rmse_of_ElasticNetCV(data, target, cv=10, fit_intercept=True):
  5. kf = KFold(len(data), n_folds=cv , shuffle=True)
  6. err_test_all, err_train_all = 0, 0
  7. for train, test in kf:
  8. en = ElasticNetCV(fit_intercept = fit_intercept)
  9. en.fit(data[train], target[train])
  10. pre_train = en.predict(data[train])
  11. err_train = pre_train - target[train]
  12. err_train_all += np.sum(err_train*err_train)
  13. pre_test = en.predict(data[test])
  14. err_test = pre_test - target[test]
  15. err_test_all += np.sum(err_test*err_test)
  16. rmse_test = np.sqrt(err_test_all/len(data))
  17. rmse_train = np.sqrt(err_train_all/(cv-1.0)/len(data))
  18. return rmse_test, rmse_train

rmse_EN.png-9.8kB

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注