[关闭]
@spiritnotes 2016-02-26T08:16:11.000000Z 字数 2837 阅读 1801

博文笔记: An example machine learning notebook

博文笔记 sklearn


原文

Step 1:Answering the question

Step 2: Checking the data

Generally, we're looking to answer the following questions:

summary statistics

  1. iris_data = pd.read_csv('iris-data.csv', na_values=['NA'])
  2. iris_data.describe()

很少有用,除非你知道数据的正常分布

  1. sb.pairplot(iris_data.dropna(), hue='class')

From the scatterplot matrix, we can already see some issues with the data set:

Step 3: Tidying the data

  1. iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'
  2. iris_data['class'].unique()
  1. iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
  2. iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()

剩下几点数据异常,为离群点,可以发现是数据的单位不对,m与cm

  1. iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1.0)]
  2. iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1.0), 'sepal_length_cm'] *= 100.0
  1. iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
  2. (iris_data['sepal_width_cm'].isnull()) |
  3. (iris_data['petal_length_cm'].isnull()) |
  4. (iris_data['petal_width_cm'].isnull())]

The general takeaways here should be:

Bonus: Testing our data

  1. assert len(iris_data_clean['class'].unique()) == 3
  2. assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5

Step 4: Exploratory analysis

Exploratory analysis is the step where we start delving deeper into the data set beyond the outliers and errors. We'll be looking to answer questions such as:

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注