博文笔记: An example machine learning notebook
博文笔记
sklearn
原文
Step 1:Answering the question
- Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
在触摸数据之前,您是否指定了数据分析问题的类型(例如探索、关联因果关系)?
- Did you define the metric for success before beginning?
你在开始前定义了衡量成功的标准吗?
- Did you understand the context for the question and the scientific or business application?
你了解问题的背景和科学或商业应用吗?
- Did you record the experimental design?
你记录了实验设计吗?
- Did you consider whether the question could be answered with the available data?
你是否考虑过这个问题是否可以用现有的数据回答?
Step 2: Checking the data
Generally, we're looking to answer the following questions:
- Is there anything wrong with the data?
- Are there any quirks with the data?
- Do I need to fix or remove any of the data?
summary statistics
iris_data = pd.read_csv('iris-data.csv', na_values=['NA'])
iris_data.describe()
很少有用,除非你知道数据的正常分布
sb.pairplot(iris_data.dropna(), hue='class')
From the scatterplot matrix, we can already see some issues with the data set:
- There are five classes when there should only be three, meaning there were some coding errors.
- There are some clear outliers in the measurements that may be erroneous: one sepal_width_cm entry for Iris-setosa falls well outside its normal range, and several sepal_length_cm entries for Iris-versicolor are near-zero for some reason.
- We had to drop those rows with missing values.
Step 3: Tidying the data
iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'
iris_data['class'].unique()
iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()
剩下几点数据异常,为离群点,可以发现是数据的单位不对,m与cm
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1.0)]
iris_data.loc[(iris_data['class'] == 'Iris-versicolor') & (iris_data['sepal_length_cm'] < 1.0), 'sepal_length_cm'] *= 100.0
- 数据不全的点
发现为同一种类型,因此采用该类其他点的平均值填充
iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
(iris_data['sepal_width_cm'].isnull()) |
(iris_data['petal_length_cm'].isnull()) |
(iris_data['petal_width_cm'].isnull())]
The general takeaways here should be:
- Make sure your data is encoded properly
- Make sure your data falls within the expected range, and use domain knowledge whenever possible to define that expected range
- Deal with missing data in one way or another: replace it if you can or drop it
- Never tidy your data manually because that is not easily reproducible
- Use code as a record of how you tidied your data
- Plot everything you can about the data at this stage of the analysis so you can visually confirm everything looks correct
Bonus: Testing our data
assert len(iris_data_clean['class'].unique()) == 3
assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5
Step 4: Exploratory analysis
Exploratory analysis is the step where we start delving deeper into the data set beyond the outliers and errors. We'll be looking to answer questions such as:
- How is my data distributed?
- Are there any correlations in my data?
- Are there any confounding factors that explain these correlations?