Similarity
Numerical measure of how alike two data objects are.
Often falls in the range [0, 1]
越大越相似
Dissimilarity
Numerical measure of how different two data objects are.
越大越不相似
Proximity rfer to a similarity or dissimilarity
Data matrix
Dissimilarity matrix
A triangular matrix.
对称的,只需要三角阵的信息。
Proximity measure for nomial attributes
Simple matching
是属性总数,是相同的属性数目。
Proximity measure for binary attributes
a contingency table for binary data, two obtect
value
1
0
1
q
r
0
s
t
* distance measure for symmetric variables
* asymmetric
* Jaccard coefficient( similarity measure for asymmetric binary variables)
Standardizing Numeric data
Z-score
原始数据减去均值除以标准差。 归一化后均值为0,标准差为1.
An alternative way:计算mean absolute deviation,替换标准差.
Minkowski distance 明科夫斯基距离
就是范数。
非负,对称,满足三角不等式.
满足这三个性质的距离可以作为一个metric(度量指标).
Special cases
曼哈顿距离 Manhattan distance
绝对值相加
欧氏距离 Euclidean distance
欧式几何距离
"supremum" distance,
最大的那个差值占主要
Ordinal Variables
order is important
用实数代替顺序,再映射到[0,1]之间,最后计算不相似度。
Attributes of Mixed Type
混合属性表示,二值、数字的、有顺序的,分别转换最后求不相似度。
Cosine similarity
Summary
03. Data Preprocessing
Overview
Why?
Major Tasks
Data cleaning
Data integration 数据集成
Data reduction
Data transformation and data discretization
Data Cleaning
Dirty data
incomplete,noisy,inconsistent,intentional
Incomplete(Missing) data
一些常规处理方法
Nosiy data
binning, regression, cluster, 人工检查
Data cleaning as a process
Data Integration 数据集成
Handling Redundancy 处理冗余
Correlation Analysis (Nominal data)
值越大相关性越高。
correlation coefficient 相关系数
covariance 协方差
简化计算
Data Reduction
Dimensionality Reduction
wavelet transform
PCA
Numerosity Reduction
Regression Analysis
Linear regression
Multiple regression
Log-linear models
Histogram analysis
Clustering
Sampling
sampling random sampling
sampling without replacement 不放回抽样
sampling with replacement 放回抽样
Stratified sampling
Data cube aggregation
Data compression
Data transformation
Normalization
Min-max normalization to
Z-score normalization
Normalization by decimal scaling
是使
06. Mining Frequent patterns, association and correlations
basic concepts
Frequent patterns
item set
k-itemset 包含k个itemset
支持度计数(abousolute support): itemset的出现频数。
an itemset X is Frequent if X's support is no less than a minsup threshold
Association rules
找两个Pattern 之间的关系。。 support: a transaction 包含 的出现概率 confidence: a transaction在包含的情况下也包含的出现概率
Closed patterns and max-pattern
closed 不存在真超项集与其具有相同的支持度计数。
closed patten(), 他不是任何其他itemset(with same support as X)的子集。
max-pattern,是frequent并且不是任何其他frequent itemset的子集。