@mynameiszhangxu 2015-01-06T07:39:13.000000Z 字数 5678 阅读 2530

数据挖掘算法认知与实践

数据挖掘

数据挖掘算法认知与实践

鸢尾花数据集

数据集包含了50个样本，都属于鸢尾属下的三个亚属，分别是山鸢尾、变色鸢尾和维吉尼亚鸢尾（英语：Virginia Iris）。四个特征被用作样本的定量分析，它们分别是花萼和花瓣的长度和宽度。
鸢尾花数据集

Kmeans聚类算法

算法内容

Kmeans作为数据挖掘十大经典算法，对其进行必要的了解对数据挖掘的学习有重要意义。
作为最经典的基于划分的聚类方法，其基本思想极为简单：以空间中k个点为中心进行聚类，对最靠近他们的对象归类。通过迭代的方法，逐次更新各聚类中心的值，直至得到最好的聚类结果。
具体做法如下（假设将样本集分为 n 类）:

适当的选择 n 个类的初始中心；
在第k次迭代中，对任意一个样本，求其到 n 各中心的距离，将该样本归到距离最短的中心所在的类;
利用均值等方法更新该类的中心值
对于所有的 n 个聚类中心，如果利用步骤2和3的迭代法更新后，值保持不变，则迭代结束，否则继续迭代。

测试（以安德森鸢尾花数据集为例）

在传入分类数时，设定值为 3
KM.setNumClusters(3);
代码如下

import java.io.File;
import weka.clusterers.SimpleKMeans;
import weka.core.DistanceFunction;
import weka.core.EuclideanDistance;
import weka.core.Instances;
import weka.core.converters.ArffLoader;
public class kmwan {
    /**
     * @param args
     */
    public static void main(String[] args) {
       // TODO Auto-generated method stub
       Instances ins = null;
       Instances tempIns = null;
       SimpleKMeans KM = null;
       DistanceFunction disFun = null;
       try{
           /*
            * 1.读入样本
            */
           File file= new File("E:/__tools__/weka/Weka-3-6/data/iris.arff");
           ArffLoader loader = new ArffLoader();
           loader.setFile(file);
           ins = loader.getDataSet();
           /*
            * 2.初始化聚类器
            */
           KM = new SimpleKMeans();       
           //设置聚类要得到的类别数量
           KM.setNumClusters(3);
           /*
            * 3.使用聚类算法对样本进行聚类
            */
           KM.buildClusterer(ins);
           /*
            * 4.打印聚类结果
            */
           tempIns = KM.getClusterCentroids();
           System.out.println("CentroIds: " + tempIns);
       }catch(Exception e){
           e.printStackTrace();
       }
    }
}

此时挖掘结果为

5.936,2.77,4.26,1.326,Iris-versicolor
5.006,3.418,1.464,0.244,Iris-setosa
6.588,2.974,5.552,2.026,Iris-virginica

可以看出 Kmeans 方法在类别数量为 3 的时候，很确切的通过线性聚类的方法，将数据集中的数据分为了三类。
当我们将类别数的值定为4的时候KM.setNumClusters(4);
结果显示为如下数据：

6.329167,2.979167,4.6,1.4625,Iris-versicolor
5.573077,2.576923,3.946154,1.2,Iris-versicolor
6.588,2.974,5.552,2.026,Iris-virginica
5.006,3.418,1.464,0.244,Iris-setosa

从中我们可以简单的发现：类别为“versicolor”的鸢尾花可能在整个种群中占更多的数量。这就是我们通过数据挖掘获得的情报。
当我们将类别数的值定为4的时候KM.setNumClusters(2);
数据显示如下：

6.262,2.872,4.906,1.676,Iris-versicolor
5.006,3.418,1.464,0.244,Iris-setosa

这时候算法将其分成了两个类别“versicolor”和“setosa”，说明只能用两种类型的鸢尾花表示整个族群的话，“virginica”这个种类将被会其他两种代表，也就是说这个种类在整个族群中不如其他两种具有代表性。这也是我们通过简单的数据挖掘发现的情报。

ChiMerge 算法

算法内容

ChiMerge 是监督的、自底向上的(即基于合并的)数据离散化方法。它依赖于卡方分析：具有最小卡方值的相邻区间合并在一起，直到满足确定的停止准则。
基本思想：对于精确的离散化，相对类频率在一个区间内应当完全一致。因此，如果两个相邻的区间具有非常类似的类分布，则这两个区间可以合并；否则，它们应当保持分开。而低卡方值表明它们具有相似的类分布。
方法如下：
一、初始化
根据要离散的属性对实例进行排序：每个实例属于一个区间；
二、合并区间，又包括两步骤
(1) 计算每一对相邻区间的卡方值
(2) 将卡方值最小的一对区间合并
卡方计算公式：
$卡方计算公式$

参数说明

m=2（每次比较的区间数是2个）
k=类别数量
Aij=第i区间第j类的实例的数量
Ri=第i区间的实例数量
$Ri代表的内容$
Cj=第j类的实例数量
$Ci代表的内容$
N=总的实例数量
$N代表的内容$
Eij= Aij的期望频率

测试（以安德森鸢尾花数据集为例）

取鸢尾花数据集作为待离散化的数据集合，使用ChiMerge算法，对四个数值属性分别进行离散化，令停机准则为max_interval=6；主要有两个步骤：

第一步，整理数据
读入鸢尾花数据集，构造可以在其上使用ChiMerge的数据结构，即, 形如 [('4.3', [1, 0, 0]), ('4.4', [3, 0, 0]),...]的列表，每一个元素是一个元组，元组的第一项是字符串，表示区间左端点，元组的第二项是一个列表，表示在此区间各个类别的实例数目；
第二步，离散化
使用ChiMerge方法对具有最小卡方值的相邻区间进行合并，直到满足最大区间数(max_interval)为6
程序最终返回区间的分裂点

from time import ctime
def read(file):
    '''read raw date from a file '''
    Instances=[]
    fp=open(file,'r')
    for line in fp:
        line=line.strip('\n') #discard '\n'
        if line!='':
            Instances.append(line.split(','))
    fp.close()
    return(Instances)
def split(Instances,i):
    ''' Split the 4 attibutes, collect the data of the ith attributs, i=0,1,2,3
        Return a list like [['0.2', 'Iris-setosa'], ['0.2', 'Iris-setosa'],...]'''
    log=[]
    for r in Instances:
        log.append([r[i],r[4]])
    return(log)
def count(log):
    '''Count the number of the same record
       Return a list like [['4.3', 'Iris-setosa', 1], ['4.4', 'Iris-setosa', 3],...]'''
    log_cnt=[]
    log.sort(key=lambda log:log[0])
    i=0
    while(i<len(log)):
        cnt=log.count(log[i])#count the number of the same record
        record=log[i][:]
        record.append(cnt) # the return value of append is None
        log_cnt.append(record)
        i+=cnt#count the next diferent item 
    return(log_cnt)
def build(log_cnt):
    '''Build a structure (a list of truples) that ChiMerge algorithm works properly on it '''
    log_dic={}
    for record in log_cnt:
        if record[0] not in log_dic.keys():
            log_dic[record[0]]=[0,0,0]
        if record[1]=='Iris-setosa':
            log_dic[record[0]][0]=record[2]
        elif record[1]=='Iris-versicolor':
            log_dic[record[0]][1]=record[2]
        elif record[1]=='Iris-virginica':
            log_dic[record[0]][2]=record[2]
        else:
            raise TypeError("Data Exception")
    log_truple=sorted(log_dic.items())
    return(log_truple)
def collect(Instances,i):
    ''' collect data for discretization '''
    log=split(Instances,i)
    log_cnt=count(log)
    log_tuple=build(log_cnt)
    return(log_tuple)
def combine(a,b):
    '''  a=('4.4', [3, 1, 0]), b=('4.5', [1, 0, 2])
         combine(a,b)=('4.4', [4, 1, 2])  '''
    c=a[:] # c[0]=a[0]
    for i in range(len(a[1])):
        c[1][i]+=b[1][i]
    return(c)
def chi2(A):
    ''' Compute the Chi-Square value '''   
    m=len(A);
    k=len(A[0])
    R=[]
    for i in range(m):
        sum=0
        for j in range(k):
            sum+=A[i][j]
        R.append(sum)
    C=[]
    for j in range(k):
        sum=0
        for i in range(m):
            sum+=A[i][j]
        C.append(sum)
    N=0
    for ele in C:
        N+=ele
    res=0
    for i in range(m):
        for j in range(k):
            Eij=R[i]*C[j]/N
            if Eij!=0:
                res=res+(A[i][j]-Eij)**2/Eij
    return res
def ChiMerge(log_tuple,max_interval):
    ''' ChiMerge algorithm  '''
    ''' Return split points '''    
    num_interval=len(log_tuple)
    while(num_interval>max_interval):               
        num_pair=num_interval-1
        chi_values=[]
        for i in range(num_pair):
            arr=[log_tuple[i][1],log_tuple[i+1][1]]
            chi_values.append(chi2(arr))
        min_chi=min(chi_values) # get the minimum chi value 
        for i in range(num_pair-1,-1,-1): # treat from the last one
            if chi_values[i]==min_chi:
                log_tuple[i]=combine(log_tuple[i],log_tuple[i+1]) # combine the two adjacent intervals
                log_tuple[i+1]='Merged'
        while('Merged' in log_tuple): # remove the merged record
            log_tuple.remove('Merged')
        num_interval=len(log_tuple)
    split_points=[record[0] for record in log_tuple]
    return(split_points)
def discrete(path):
    ''' ChiMerege discretization of the Iris plants database '''
    Instances=read(path)
    max_interval=6
    num_log=4
    for i in range(num_log):
        log_tuple=collect(Instances,i) # collect data for discretization
        split_points=ChiMerge(log_tuple,max_interval) # discretize data using ChiMerge algorithm 
        print(split_points)
if __name__=='__main__':
    print('Start: ' + ctime())
    discrete('c:\\Python33\\iris.data')
    print('End: ' + ctime())

获得的数据如下：

['4.3', '4.9', '5.0', '5.5', '5.8', '7.1']
['2.0', '2.3', '2.5', '2.9', '3.0', '3.4']
['1.0', '3.0', '4.5', '4.8', '5.0', '5.2']
['0.1', '1.0', '1.4', '1.7', '1.8', '1.9']

数据挖掘算法认知与实践

鸢尾花数据集

Kmeans聚类算法

算法内容

测试（以安德森鸢尾花数据集为例）

ChiMerge 算法

算法内容

测试（以安德森鸢尾花数据集为例）

内容目录

选择主题