@HaomingJiang 2016-06-03T03:30:44.000000Z 字数 3755 阅读 2894

Chp8 Clustering: Basic Concept & Algorithm

数据挖掘导论 笔记

Chp8 Clustering: Basic Concept & Algorithm

8.1 Overview

很多时候，聚类只是其他问题的起点。
组内相似，组间差别大
Clustering Categories：partitional&hierarchical、overlapping(fuzzy clusting)&exclusive、complete&partial(exclude nosies or focus on some interesting part)
Cluster Categories:
completely seperated
center-based
graph-based(e.g. contiguity-based):容易受到噪声的影响，比如一个小桥
density-based 可以消除噪声，不过也就会消掉一些不显著的族
concept cluster： too complicated

8.2 K-means

Basic algorithm：...
PS:
1. 选择合适的距离，有时候可以避免大量的相似度计算(e.g. 二分K均值)
2. 给好相似性度量和目标函数，质心可以被确定。
见Table 8-2
3. 初始质心的选择很影响效果。
solution1: 先层次聚类得到K个族，用他们的质心，有效情况：1.样本小的时候，最多上千(层次聚类开销大)2.K相对样本较小
solution2: 随机选第一个点，然后选一个离他最远的，以此类推。不过可能选到离群点。计算开销也大

空间复杂度：O((m+K)n) n,属性数，m点数
时间复杂度：O(I*K*m*n) I为迭代次数

处理空族：需要某个策略产生替补质心。可以选一个离当前所有质心最远的点。也可以从SSE最大的一个组选择一个质心，把它分裂开。

离群点：可以找到并删除，不过有时候确是感兴趣的部分。

用后处理降低SSE：
分裂一个簇：e.g.有最大SSE的
引进新质心：选择里质心们最远的
拆散一个簇：选择一个使总SSE增加最少的
合并两个簇：如质心最近的

增量的更新质心：
可以避免空簇。可以调整每个点的权值，尽管很难设。可能会导致，对点更新次序的依赖。

二分K-means
先分成两个，然后再选一个分成两个。
不太受初值影响。

不好处理的簇的类型：
非球形不规则，尺寸不成比例的，密度不成比例的。

8.3 Hierarchical Clustering

自上而下or自下而上
用dendrogram or nested cluster diagram表示

8.3.1 自下而上

不断合并最近的两个簇
定义簇的邻近度：
single link，complete link，group average，Ward，质心 etc.
空间复杂度 $O(m^2)$
时间复杂度 $O(m^2log(m))$

single linkage: 处理非椭圆的簇，对噪声和离群点敏感

complete linkage: 对噪声和离群点不敏感，不过会使大的簇破裂

group average: 折中办法

Ward：从导致SSE增加来度量相似性。相似度用距离平方的时候与group average相似。

Lance-Williams公式，
用来更新簇的相似度：
$A+B\rightarrow R$
$p(R,Q)=\alpha_Ap(A,Q)+\alpha_Bp(B,Q)+\beta p(A,B)+\gamma |p(A,Q)-p(B,Q)|$
对应系数见Table8-5

特点：
1. 无全局目标函数，没有局部极小，初值点的问题。不过复杂度高
2. 处理大小不同的簇，在LW公式中系数对大小不同的簇的调整
3. 合并了就不能撤销了，有的会去修复，e.g. 移动树的分支，另一种是使用划分聚类技术来创建很多小簇，再出发进行层次聚类。

8.4 DBSCAN

要选择合适的半径Eps————DBSCAN
core point：在密集处的内部。即个数超过某个阈值MinPts
border point：非core point，但在某个core point的领域内
noise point：其他

DBSCAN：
任意两个够近的core point在一个cluster中。
1. 标记点
2. 删掉noise points
3. 距离为Eps中的核心点连一条边
4. 找连通分支
5. 每个边界点指派到一个与它相连的簇中

时间复杂度：最坏 $O(m^2)$ ，可以用kdtree之类的优化。
空间复杂度：O(m)

参数设定：
观察k-distance，对于簇内的点会比较小，对noise point 会比较大。然后排序，画出来，在拐角处选择Eps，k即为MinPts

密度变化很大的话，会bug

advantages：相对抗噪声，可以发现边界任意大小的簇。
drawbacks：密度变化太大的话，就会有麻烦。对高维数据的密度定义也有问题，因为体积膨胀的特别快。计算邻近度的时候开销很大。

8.5 Cluster Validation

important questions：
1. clustering tendency，识别数据中是否真的有非随机成分
2. 确定簇的个数
3. 不引用附加的信息，评估聚类分析结果对数据拟合的情况。
4. 与已知的客观结果比较。
5. 比较两个簇集，确定哪个好

indexes：
1. unsupervised index or internal index：cluster cohesion & cluster separation
2. supervised or external index
3. relative index

8.5.1 unsupervised approach

$overall validity=\sum_{i=1}^Kw_i validaity(C_i)$
the setting of the weights depends on the type of validaity function.(Details are in Table8-6)
cluster cohesion: the higher the better
cluster separation: the lower the better

Based on Graph:
$cohesion(C_i)=\sum_{x,y \in C_i}proximity(x,y)$
$separation(C_i,C_j)=\sum_{x \in C_i, y \in C_j}proximity(x,y)$

Based on Prototype:
$cohesion=\sum_{x \in C_i} proximity(x,c_i)$
$separation(C_i,C_j)=proximity(c_i,c_j)$
$separation(C_i)=proximity(c_i,c)$

indexes which based on graph somtimes are equivalant to those based on protoype under some proximity

TSS=SSE(cohesion)+SSB(separation) (under Euclidian distance)

Silhouette coefficient
1. $a_i$ is the mean distance between $x_i$ and other $x_j$ in the same cluster.
2. $b_i$ is the minimum mean distance between $x_i$ and other $x_j$ in the same cluster other than the one that $x_i$ in
3. silhouette coefficient: $s_i=\frac{b_i-a_i}{max(a_i,b_i)}$
$-1<s<1$ ,越大越好

基于邻近度矩阵
实际和理想的相似度矩阵的近似程度
理想的：簇内距离为1，其他为0。看实际的和理想的相关性

可视化方法：
按每个类排列好属性，画出灰度图。看是否有对角块的pattern

之前的讨论都是基于划分的方法

基于层次聚类的：
cophenetic distance.即首次将两个项放在一起的时候的距离。
cophenetic correlation coefficient，CPCC是该矩阵和原来的相异度矩阵的项之间的相关性。

8.5.2 Determine K

用非监督度量来确定，作图
寻找其中的拐点，峰值，下降点

8.5.3 Clustering Tendency

用来判断是不是真的有簇
Hopkins统计量，随机生成p个点，从原数据抽p个点。分别找到他们每个点的最近距离 $u_i$ & $w_i$ 。 $H=\frac{sum(w_i)}{sum(w_i)+sum(u_i)}$

8.5.4 Efficiency of Cluster

classification-oriented:熵，纯度，精度，召回率，F度量
similarity-oriented:根据实际和理想的相似度矩阵，Rand & Jaccard系数
对于层次聚类：too hard，只有一个根据类标号集。

有效性度量的显著性
可以用统计的办法，看是否是统计显著的。