@spiritnotes 2016-02-26T08:16:11.000000Z 字数 2837 阅读 2399

博文笔记: An example machine learning notebook

博文笔记 sklearn

原文

Step 1:Answering the question

Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
在触摸数据之前，您是否指定了数据分析问题的类型（例如探索、关联因果关系）？
Did you define the metric for success before beginning?
你在开始前定义了衡量成功的标准吗？
Did you understand the context for the question and the scientific or business application?
你了解问题的背景和科学或商业应用吗？
Did you record the experimental design?
你记录了实验设计吗？
Did you consider whether the question could be answered with the available data?
你是否考虑过这个问题是否可以用现有的数据回答？

Step 2: Checking the data

Generally, we're looking to answer the following questions:

Is there anything wrong with the data?
Are there any quirks with the data?
Do I need to fix or remove any of the data?

summary statistics

iris_data = pd.read_csv('iris-data.csv', na_values=['NA'])
iris_data.describe()

很少有用，除非你知道数据的正常分布

sb.pairplot(iris_data.dropna(), hue='class')

From the scatterplot matrix, we can already see some issues with the data set:

There are five classes when there should only be three, meaning there were some coding errors.
There are some clear outliers in the measurements that may be erroneous: one sepal_width_cm entry for Iris-setosa falls well outside its normal range, and several sepal_length_cm entries for Iris-versicolor are near-zero for some reason.
We had to drop those rows with missing values.

Step 3: Tidying the data

多类别
是由于数据中的类别错误导致

iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'
iris_data['class'].unique()

错误点
该点数据值是不可能的，直接去掉

iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()

剩下几点数据异常，为离群点，可以发现是数据的单位不对，m与cm

iris_data.loc[(iris_data['class'] == 'Iris-versicolor') &             (iris_data['sepal_length_cm'] < 1.0)]
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') &              (iris_data['sepal_length_cm'] < 1.0), 'sepal_length_cm'] *= 100.0

数据不全的点
发现为同一种类型，因此采用该类其他点的平均值填充

iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
              (iris_data['sepal_width_cm'].isnull()) |
              (iris_data['petal_length_cm'].isnull()) |
              (iris_data['petal_width_cm'].isnull())]

The general takeaways here should be:

Make sure your data is encoded properly
Make sure your data falls within the expected range, and use domain knowledge whenever possible to define that expected range
Deal with missing data in one way or another: replace it if you can or drop it
Never tidy your data manually because that is not easily reproducible
Use code as a record of how you tidied your data
Plot everything you can about the data at this stage of the analysis so you can visually confirm everything looks correct

Bonus: Testing our data

assert len(iris_data_clean['class'].unique()) == 3
assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5

Step 4: Exploratory analysis

Exploratory analysis is the step where we start delving deeper into the data set beyond the outliers and errors. We'll be looking to answer questions such as:

How is my data distributed?
Are there any correlations in my data?
Are there any confounding factors that explain these correlations?