[关闭]
@Bruce1Tone 2022-03-13T06:32:59.000000Z 字数 29726 阅读 683

电子科技大学

实 验 报 告

学生姓名:唐智崴
学号:2017110801028
指导教师:邵俊明
数据挖掘实验

目录

  实验一
  实验二
  实验三
  实验四
  附录
   实验一代码
    数据归一化
    缺失值处理
    特征筛选
   实验二代码
    关联规则挖掘
   实验三代码
    KNN分类
    决策树分类
   实验四代码
    K-Means聚类算法


实验一

一、实验项目名称:

认识数据与数据预处理

二、实验原理:

  1. 数据属性归一化:将数据属性观测值按照一定规则映射到规定的区间(如[0,1])
  2. 缺失值处理:利用属性均值填补缺失值
  3. 特征筛选:计算属性的信息增益,过滤掉低信息增益的特征,留下高信息增益的特征

三、实验目的:

了解Weka工具包Eclipse平台;认识和了解数据,并且能对其进行简单的预处理

四、实验内容:

该实验我选择用两种方法实现:利用Weka实现和自行python编程实现。
首先是利用Weka中现成的功能直接完成上述操作,然后自己再利用python手动实现上述功能,以便对该过程有更深的了解。

五、实验步骤:

Weka实现:

  1. 数据归一化:
    Weka GUI -> Explorer -> Open file -> 选择数据文件data1.xlsx
    然后在Filter选择weka -> filters -> unsupervised -> attribute -> Normalize
    最后将实验结果存储在data1.xlsx文件的另一列中
  2. 缺失值处理:
    数据集为data2.xlsx,同上一步操作,这一步改用Filter -> ReplaceMissingValues,即缺失值填补的滤波器
  3. 特征筛选:
    数据集为data3.xlsx,同上一步操作,这一步改用Filter -> AttributeSelection,参数evaluator选择InfoGainAtrributeEvalRanker设置为6

Python实现:

1-数据归一化:

完整代码见附录:点此跳转
实验流程是:

  • 首先使用get_data()函数将data_1.xlsx中的数据集读入并存储在Data类的ori_data
  • 然后使用GetMinAndMax()函数将ori_data中的最大值和最小值找到并存储在Data类中
  • 最后使用Normalize()函数将数据映射到[0,1]区间,由RecordNormalizedData()函数将归一化后的数据写入原文件中

上述函数由下列代码给出:

  1. class Data():
  2. def __init__(self,sheet_data):
  3. self.maxA = 0
  4. self.minA = 0
  5. self.ori_data = self.get_data(sheet_data)
  6. self.normalized_data = self.ori_data
  7. self.length = len(self.ori_data)
  8. self.GetMinAndMax()
  9. def get_data(self,sheet_data):
  10. #获取excel中存储的数据
  11. data = []
  12. i = 0
  13. while True:
  14. try:
  15. data.append(sheet_data.cell(i+1,0).value)
  16. i += 1
  17. except IndexError:
  18. break
  19. return data
  20. def GetMinAndMax(self):
  21. #获取数据中的最大值和最小值
  22. temp_min = self.ori_data[0]
  23. temp_max = self.ori_data[0]
  24. for i in range(self.length):
  25. if self.ori_data[i] <= temp_min:
  26. temp_min = self.ori_data[i]
  27. if self.ori_data[i] >= temp_max:
  28. temp_max = self.ori_data[i]
  29. self.maxA = temp_max
  30. self.minA = temp_min
  31. def Normalize(self):
  32. #数据归一化操作
  33. for i in range(self.length):
  34. self.normalized_data[i] = (self.ori_data[i] - self.minA)/(self.maxA - self.minA)
  35. def RecordNormalizedData(self,sheet):
  36. #记录归一化的数据
  37. sheet.write(0,1,'Normalized Data')
  38. for i in range(self.length):
  39. sheet.write(i+1,1,round(self.normalized_data[i],2))

2-数据缺失值处理:

完整代码见附录:点此跳转
该操作的核心在于FillBlank()函数,即将数据的均值求出并填回缺失值,函数由如下代码定义:

  1. def FillBlank(self):
  2. count = 0
  3. data_sum = 0
  4. blank_index = []
  5. for i in range(self.length):
  6. if self.ori_data[i] is not '':
  7. count += 1
  8. data_sum += self.ori_data[i]
  9. else:
  10. blank_index.append(i)
  11. aver = data_sum / count
  12. for i in blank_index:
  13. self.filled_data[i] = round(aver,2)

3-特征筛选:

完整代码见附录:点此跳转
该实验的重点在于,统计每一个属性的信息与标签,并进行信息熵信息增益的计算
信息熵的计算方法由下式给出:


代码实现是:

  1. def get_Ent_D(self):
  2. #获取根节点的信息熵Ent_D
  3. count_true = 0
  4. count_false = 0
  5. for i in self.table[self.ncols-1]:
  6. if i == '是':
  7. count_true += 1
  8. elif i == '否':
  9. count_false += 1
  10. p_true = count_true/(count_true + count_false)
  11. p_false = count_false/(count_true + count_false)
  12. Ent_D = 0 - (p_true * math.log(p_true,2) + p_false * math.log(p_false,2))
  13. return Ent_D

然后计算每个属性的信息增益Gain:


代码实现是:

  1. def get_information_gain(self):
  2. #获取该属性的信息增益
  3. Ent_D_i = [] #存储该属性各个状态的信息熵
  4. p_D_i = [] #存储该属性各个状态的比例
  5. sample_quantity = 0
  6. result = 0
  7. for i in self.dict_True:
  8. sample_quantity += self.dict_True[i]
  9. for i in self.dict_False:
  10. sample_quantity += self.dict_False[i]
  11. for i in self.status:
  12. total = self.dict_True[i] + self.dict_False[i]
  13. if self.dict_True[i] != 0:
  14. if self.dict_False[i] != 0:
  15. temp_Ent_D_i = 0 - ((self.dict_True[i]/total) * math.log(self.dict_True[i]/total,2) + (self.dict_False[i]/total) * math.log(self.dict_False[i]/total,2))
  16. else:
  17. temp_Ent_D_i = 0 - self.dict_True[i]/total * math.log(self.dict_True[i]/total,2)
  18. else:
  19. temp_Ent_D_i = 0 - self.dict_False[i]/total * math.log(self.dict_False[i]/total,2)
  20. temp_p = total/sample_quantity
  21. Ent_D_i.append(temp_Ent_D_i)
  22. p_D_i.append(temp_p)
  23. for i in range(len(Ent_D_i)):
  24. result += Ent_D_i[i] * p_D_i[i]
  25. result = self.Ent_D - result
  26. return result

六、实验数据及结果分析:

数据归一化结果:

image_1ebj11knr1llell0uhu14jo9sv9.png-47.9kB

原数据在[0,100]区间,归一化后映射在[0,1]区间,实验结果符合预期

数据缺失值处理结果:

image_1ebj1474tsvqsgs5iuru9145m.png-38.3kB

原数据缺失值被填入了保留两位小数的均值,实验结果符合预期

特征筛选处理结果:

原始数据集为:

image_1ebj1au8a1h1l6ba7c6187vdgv13.png-54kB

实验结果为:

image_1ebj1cvvb1sqk1msc9s78j912fu1g.png-42.1kB

成功计算出各个离散属性的信息增益,实验结果符合预期

七、实验结论:

实验结果符合预期,具体结果见上节

八、总结及心得体会:

在数据挖掘的过程中,第一步也是最重要的一步就是数据预处理,通过实际操作我掌握了数据预处理中的数据归一化数据缺失值处理特征筛选等预处理的方法,为后面的数据挖掘学习奠定基础

九、对本实验过程及方法的改进建议:

该实验中特征筛选环节,由于没有数据集的限制,所以只做了离散属性的特征筛选。后面有时间可以补上连续属性的信息增益计算和特征筛选

报告评分:          

指导老师签字:          


实验二

一、实验项目名称:

关联规则挖掘

二、实验原理:

先产生频繁项集,然后再从频繁项集中提取高置信度的规则,即强规则。该过程中利用Apriori算法进行剪枝处理

三、实验目的:

掌握关联规则挖掘的基本概念、原理和一般方法,掌握Apriori算法,了解FP-GROWTH算法

四、实验内容:

根据如下数据集,挖掘商品的关联规则,并返回强规则

image_1ebj3f61bpartimu6l14411h8p9.png-10.7kB

五、实验步骤:

完整代码见附录:点此跳转

读取数据集

首先用ReadData()函数将data.xlsx中的数据集读入到layer1中,如下:

  1. layer1 = ReadData(sheet)
  2. def ReadData(sheet):
  3. temp_layer = []
  4. for i in range(1,sheet.ncols):
  5. temp_node = Node(sheet.nrows - 1)
  6. temp_node.content.append(sheet.cell(0,i).value)
  7. count = 0
  8. for j in range(1,sheet.nrows):
  9. if sheet.cell(j,i).value == 1:
  10. count += 1
  11. temp_node.data.append(int(sheet.cell(j,i).value))
  12. temp_node.sup = count
  13. temp_layer.append(temp_node)
  14. return temp_layer

剪枝

然后根据Apriori算法,非频繁项集的超集一定是非频繁的,所以对所有的非频繁项集进行剪枝操作,然后更新该层的所有结点。以上过程由UpdateLayer()函数实现,代码如下:

  1. #更新当前层,剪枝
  2. def UpdateLayer(ori_list,threshold):
  3. temp_list = []
  4. for i in ori_list:
  5. if i.sup > threshold:
  6. temp_list.append(i)
  7. return temp_list

连接

完成剪枝操作后,将剪枝后的该层各点相互连接,由GenNewLayer()函数构建成新的一层,函数由如下给出:

  1. #连接上层生成下层
  2. def GenNewLayer(ori_list):
  3. new_list = []
  4. for i in ori_list:
  5. for j in ori_list:
  6. if i != j:
  7. new_list.append(Merge(i,j))
  8. return new_list
  9. #Merge函数合并a,b两个结点
  10. def Merge(a,b):
  11. temp = Node(a.total)
  12. #合并self.content的内容,并去重
  13. for i in a.content:
  14. temp.content.append(i)
  15. for i in b.content:
  16. if i not in a.content:
  17. temp.content.append(i)
  18. #统计data和sup的内容
  19. count = 0
  20. for i in range(a.total):
  21. if a.data[i] == b.data[i] and a.data[i] == 1:
  22. count += 1
  23. temp.data.append(1)
  24. else:
  25. temp.data.append(0)
  26. temp.sup = count
  27. return temp

最后控制输出所有的强规则即可

六、实验数据及结果分析:

image_1ebj46lvkj20mdf1fkp6mo1j9vm.png-39.3kB

其中,第一项为关联规则;第二项为具体关联的事件,1表示规则为真;第三项表示该规则的置信度。

七、实验结论:

利用Apriori算法可以不用遍历所有数据,即可完成关联规则挖掘。最后输出的强关联规则经过验证,是正确的,实验成功。

八、总结及心得体会:

Apriori算法的关键在于其逆应用,即:所有费频繁项集的超集都是非频繁的。借助此算法,在每次连接形成新的一层(layer)后,可以根据该算法将支持度较低的非频繁项集剪枝处理,大大减少了计算量,是一种常用的关联规则挖掘算法

九、对本实验过程及方法的改进建议:

由于是自己手敲的数据,该实验采用的数据集较小,计算开销的节省效果还不明显。可以换成较大的数据集进行测试,就能明显感受到Apriori算法的优越性能

报告评分:          

指导老师签字:          


实验三

一、实验项目名称:

分类实验

二、实验原理:

KNN算法k邻近算法,即将测试数据放入训练集中,选择距离最短的k个数据作为邻居,然后预测它的类别和k个邻居中其所属类别最多的一致,属于一种无监督的lazy learning,不需要对训练样本进行学习
决策树算法即根据数据的每个属性,分别计算其信息熵和信息增益,然后选择信息增益最大的属性作为父节点,在此基础上重复以上过程直到大部分数据能根据该决策树正确分类

三、实验目的:

了解分类的基本概念、原理和一般方法,掌握分类的基本算法,实现KNN决策树算法

四、实验内容:

KNN算法

首先将训练数据导入,然后遍历测试集,对每一个测试数据求其k个最邻近的邻居,然后按照邻居最多的属性进行预测,并与该数据的实际属性进行对比,实现分类

决策树算法:

首先将训练数据导入,并计算各个属性的信息增益,然后根据信息增益构建决策树。
再遍历测试数据,对每一个测试数据使用该决策树进行分类预测,并与实际的标签进行对比

以上两个实验共用同一个数据集,训练数据集为iris.2D.train.arff,测试数据集为iris.2D.test.arff

五、实验步骤:

KNN算法详解

完整代码见附录:点此跳转

定义数据类型

首先定义一个Node类,用于存放每一行测试数据,定义如下:

  1. class Node():
  2. def __init__(self):
  3. self.petallength = 0
  4. self.petalwidth = 0
  5. self.classes =''

利用GetData()函数进行数据读取后,进行上述过程的预测,由Predict()函数实现:

  1. def GetDist(node_a,node_b):
  2. #计算两点的距离
  3. distance = math.sqrt((node_a.petallength - node_b.petallength)**2 + (node_a.petalwidth - node_b.petalwidth)**2)
  4. return distance
  5. def Predict(train_list,node_a,k):
  6. #开始预测数据
  7. dist_list = [] #存放离各点距离
  8. index_list = [] #存放K邻近的点的索引index
  9. for i in train_list:
  10. dist_list.append(GetDist(node_a,i))
  11. dist_copy = dist_list
  12. for i in range(k): #寻找k邻近的索引值,并存放在index_list中
  13. temp_min = 999
  14. temp_index = 0
  15. for j in range(len(train_list)):
  16. if dist_copy[j] <= temp_min:
  17. temp_min = dist_copy[j]
  18. temp_index = j
  19. dist_copy[temp_index] = 999
  20. index_list.append(temp_index)
  21. classes = {}
  22. #统计k邻近的分类结果,并以字典形式存放在classes中
  23. for i in index_list:
  24. if train_list[i].classes not in classes:
  25. classes[train_list[i].classes] = 1
  26. else:
  27. classes[train_list[i].classes] += 1
  28. result = ''
  29. max_times = 0
  30. #选取次数最多的分类结果并存放在result中
  31. for i in classes:
  32. if classes[i] >= max_times:
  33. max_times = classes[i]
  34. result = i
  35. return result

最后是主函数RunTest(),其附带的一个功能是统计预测的正确率并输出:

  1. def RunTest(train_list,test_list,k):
  2. #开始运行测试
  3. count = 0
  4. for i in test_list:
  5. result = Predict(train_list,i,k)
  6. if i.classes == result: #判断预测结果与标签是否相符
  7. count += 1
  8. print('sample',test_list.index(i),'is correctly classified as',i.classes,'!')
  9. rate = float(count / len(test_list))
  10. return rate

决策树算法详解

完整代码见附录:点此跳转

首先定义一个Node类,同上,然后定一个Tree类作为决策树的结点:

  1. class Tree():
  2. def __init__(self):
  3. self.name = ''
  4. self.standard = 0
  5. self.left = None
  6. self.right = None

然后用GetEnt()函数计算某属性的信息熵,用GetGain()函数计算某属性的信息增益:

  1. def GetGain(train_list,name):
  2. #计算信息增益
  3. if name is 'petallength':
  4. data_list = []
  5. for i in train_list:
  6. data_list.append(i.petallength)
  7. median = np.median(data_list)
  8. ent_d = GetEnt(train_list)
  9. d_list = []
  10. temp_list = []
  11. for i in train_list:
  12. if i.petallength >= median:
  13. temp_list.append(i)
  14. d_list.append(temp_list)
  15. gain = ent_d
  16. for i in d_list:
  17. gain = gain - len(i)*GetEnt(i)/len(train_list)
  18. return gain
  19. elif name is 'petalwidth':
  20. data_list = []
  21. for i in train_list:
  22. data_list.append(i.petalwidth)
  23. median = np.median(data_list)
  24. ent_d = GetEnt(train_list)
  25. d_list = []
  26. temp_list = []
  27. for i in train_list:
  28. if i.petalwidth >= median:
  29. temp_list.append(i)
  30. d_list.append(temp_list)
  31. gain = ent_d
  32. for i in d_list:
  33. gain = gain - len(i)*GetEnt(i)/len(train_list)
  34. return gain
  35. def GetEnt(train_list):
  36. #计算信息熵
  37. count = {}
  38. for i in train_list:
  39. if i.classes not in count:
  40. count[i.classes] = 1
  41. elif i.classes in count:
  42. count[i.classes] += 1
  43. total = 0
  44. for i in count:
  45. total += count[i]
  46. ent = 0
  47. for i in count:
  48. ent = ent - count[i]/total * math.log(2,count[i]/total)
  49. return ent

以上是对于离散属性的求解,而对于连续属性,则需要求中位数进行二分法,将之转换为离散属性再进行计算,其中求中位数的函数为GetMedian()

  1. def GetMedian(train_list,name):
  2. if name is 'petallength':
  3. data_list = []
  4. for i in train_list:
  5. data_list.append(i.petallength)
  6. median = np.median(data_list)
  7. elif name is 'petalwidth':
  8. data_list = []
  9. for i in train_list:
  10. data_list.append(i.petalwidth)
  11. median = np.median(data_list)
  12. return median

最后是生成决策树的函数GenTree(),返回决策树的根节点:

  1. def GenTree(train_list):
  2. root = Tree()
  3. gain = []
  4. gain.append(GetGain(train_list,'petallength'))
  5. gain.append(GetGain(train_list,'petalwidth'))
  6. if gain[0] >= gain[1]:
  7. root.name = 'petallength'
  8. root.standard = GetMedian(train_list,'petallength')
  9. root.left = Tree()
  10. root.left.name = 'petalwidth'
  11. root.left.standard = GetMedian(train_list,'petalwidth')
  12. root.left.left = Tree()
  13. root.left.left.name = '1'
  14. root.left.right = Tree()
  15. root.left.right.name = '2'
  16. root.right = Tree()
  17. root.right.name = 'petalwidth'
  18. root.right.standard = GetMedian(train_list,'petalwidth ')
  19. root.right.left = Tree()
  20. root.right.left.name = '3'
  21. root.right.right = Tree()
  22. root.right.right.name = '4'
  23. else:
  24. root.name = 'petalwidth'
  25. root.standard = GetMedian(train_list,'petalwidth')
  26. root.left = Tree()
  27. root.left.name = 'petallength'
  28. root.left.standard = GetMedian(train_list,'petallength')
  29. root.left.left = Tree()
  30. root.left.left.name = '1'
  31. root.left.right = Tree()
  32. root.left.right.name = '2'
  33. root.right = Tree()
  34. root.right.name = 'petallength'
  35. root.right.standard = GetMedian(train_list,'petallength')
  36. root.right.left = Tree()
  37. root.right.left.name = '3'
  38. root.right.right = Tree()
  39. root.right.right.name = '4'
  40. return root

六、实验数据及结果分析:

KNN结果:

image_1ebj7c6pfkif2qg1ko43a11u3a9.png-213.9kB

可以看到,knn算法的准确率还是比较高的,准确率高达96%

决策树结果:

image_1ebj7e39r1pi5m7p142obi9hs2m.png-188.3kB

决策树算法的准确率低一点,只有72%,推测可能是因为连续数据离散化的过程中直接找中位数的模型过于简单,准确率还是基本表现不错的。

七、实验结论:

根据以上结果显示,knn的算法有96%的准确率,还是比较高了。而决策树算法的准确率只有72%,应该是连续数据离散化处理的过程中出了问题。然后我尝试使用离散数据集进行测试,发现准确率有92%,证实了我的猜想。总体来说,基本达到实验要求

八、总结及心得体会:

分类是数据挖掘中最常见的一种数据处理,他可以根据现有的数据集预测新数据的属性,有广泛的应用。例如模式识别中,涉及到的最基本的思想就是分类。了解了分类的基本方法后,对数据挖掘又有了更深一步的理解

九、对本实验过程及方法的改进建议:

可以再思考一下连续数据离散化的过程,优化一下决策树算法

报告评分:          

指导老师签字:          


实验四

一、实验项目名称:

聚类实验

二、实验原理:

K-Means算法即:首先随机在途中取k个点作为聚类中心,然后对途中每所有点求对这K个聚类中心的距离,将该点归到距离最近的聚类中心的簇。
然后更新簇内的聚类中心,再重复以上操作,直到聚类中心不再移动或聚类中心移动的距离小于一个阈值(Threshold),即可停止迭代。最后得到聚类的各个簇。

三、实验目的:

了解聚类的基本概念、原理和一般方法,掌握聚类的基本算法,学会调用WEKA包处理kmeans聚类问题,并自己编程实现K-Means, DBSCAN算法等

四、实验内容:

  1. 学会调用WEKA包处理kmeans聚类问题:
  2. 自己编程实现K-Means算法

以下为部分原始数据:

image_1ebjaqmcl1gvp1k0f1v69508lr99.png-37.5kB

五、实验步骤:

完整代码详见附录:点击跳转

初始化类

首先定义一个KMeans类,用于存放聚类的结果:

  1. class KMeans():
  2. #存放kmeans结果的类
  3. def __init__(self,data_set,k):
  4. self.data_set = data_set #原始数据集
  5. self.num = k #聚类
  6. self.data_result = self.InitResult(k) #存放聚类结果
  7. def InitResult(self,k):
  8. result_list = []
  9. for i in range(k):
  10. temp_list = [[],[]]
  11. result_list.append(temp_list)
  12. return result_list

分簇、更新聚类中心

然后GetData()读取所需要的数据。在RunTest()中,首先利用随机数初始化k个随机的聚类中心,然后开始GetCluster()将各个点分到各聚类中心的簇、UpdateCenters()更新各个簇的聚类中心:

  1. def RunTest(data_set,k,colors,threshold):
  2. new_centers = []
  3. centers = []
  4. randlist = random.sample(range(0,len(data_set[0])),k)
  5. #初始化k个聚类中心
  6. for i in randlist:
  7. temp_center = [data_set[0][i],data_set[1][i]]
  8. new_centers.append(temp_center)
  9. result = KMeans(data_set,k)
  10. #经过更新后的聚类中心,如果没有变化,就停止迭代
  11. count = 0
  12. while new_centers != centers:
  13. centers = new_centers
  14. result = GetCluster(centers,data_set)
  15. new_centers = UpdateCenters(centers,result.data_result,threshold)
  16. DrawScatter(result,colors,centers)

其中,用于将点分类到各个簇的GetCluster()函数和用于更新各个簇的聚类中心的UpdateCenters()函数由以下给出:

  1. def GetCluster(centers,data_set):
  2. #根据当前的聚类中心centers进行聚类划分,返回聚类结果
  3. result = KMeans(data_set,len(centers))
  4. for i in range(len(data_set[0])):
  5. #遍历每个点
  6. min_dist = 99999
  7. min_index = 0
  8. for j in range(len(centers)):
  9. #遍历聚类中心
  10. temp_dist = math.sqrt((data_set[0][i] - centers[j][0])**2 + (data_set[1][i] - centers[j][10])**2)
  11. if temp_dist <= min_dist:
  12. min_dist = temp_dist
  13. min_index = j
  14. result.data_result[min_index][0].append(data_set[0][i])
  15. result.data_result[min_index][11].append(data_set[1][i])
  16. return result
  17. def UpdateCenters(centers,cluster,threshold):
  18. #寻找新的聚类中心
  19. new_centers = []
  20. for i in range(len(centers)):
  21. #遍历每个现有的聚类中心
  22. distance = []
  23. for j in range(len(cluster[i][0])):
  24. #遍历第i类中的每个点,将之作为临时中心
  25. temp_center = [cluster[i][0][j],cluster[i][12][j]]
  26. dists = 0
  27. for m in range(len(cluster[i][0])):
  28. #计算以该点作为聚类中心的各点距离
  29. temp_dist = math.sqrt((temp_center[0] - cluster[i][0][m])**2 + (temp_center[1] - cluster[i][13][m])**2)
  30. dists += temp_dist
  31. distance.append(dists)
  32. min_dist = distance[0]
  33. for n in distance:
  34. #找到距离和最小的临时点
  35. if n <= min_dist:
  36. min_dist = n
  37. min_index = distance.index(min_dist) #存放距离和最小点的索引
  38. temp_center = [cluster[i][0][min_index],cluster[i][14][min_index]]
  39. #判定,当新老中心点距离delta小于一个阈值时,则认为已经不用继续了
  40. delta = (centers[i][0] - temp_center[0])**2 + (centers[i][15] - temp_center[1])**2
  41. if delta >= threshold:
  42. new_centers.append(temp_center)
  43. else:
  44. new_centers.append(centers[i])
  45. return new_centers

数据可视化

最后由DrawScatter()函数画出各个簇的图像和聚类中心,并用不同颜色标注:

  1. def DrawScatter(result,colors,centers):
  2. #转换一下centers的格式
  3. centers_trans = []
  4. temp_centersx = []
  5. temp_centersy = []
  6. for i in centers:
  7. temp_centersx.append(i[0])
  8. temp_centersy.append(i[1])
  9. centers_trans = [temp_centersx,temp_centersy]
  10. for i in range(len(result.data_result)):
  11. plt.scatter(result.data_result[i][0],result.data_result[i][16],c = colors[i],marker = '.')
  12. plt.scatter(centers_trans[0],centers_trans[1],c = colors[len(result.data_result) + 1],marker = 'o')
  13. plt.show()

参数说明

其中,RunTest()的4个参数分别是data_set k colors threshold,其中data_set代表原始数据集,k代表聚类中心的个数,colors是存放颜色参数的一个列表,threshold则是控制迭代出口的阈值,其越小聚类结果越精确。
由于k是事先给出的,所以我们还得先多尝试几个k,选出最佳的k

六、实验数据及结果分析:

由于聚类中心个数k和迭代阈值threshold需要自行给出,所以先尝试不同参数组合的结果:

k=2,threshold=1

结果如图所示:

image_1ebjbgram1jok1h065sc170l6vem.png-40.6kB

很显然,聚类中心应该为3,所以k取3较为合理

k=3,threshold=1

结果如图所示:

image_1ebjbjoje1gd61oet1tl91dpe1l0e1g.png-41kB

聚类效果明显,黄色和红色的聚类中心效果不错,蓝色簇的聚类中心稍偏,可以适当调小threshold

k=3,threshold=0.01

结果如图所示:

image_1ebjbqe011hqqm33gr21m3c1nqu2a.png-41.2kB

可以看出,聚类和聚类中心的效果明显好了很多,基本达到预期结果

误差分析

由于初始的k个聚类中心是随机的,而聚类结果又与初始的聚类中心是相关的,所以偶尔会出现下图这种情况。可以通过多运行几次取平均值来避免偶然误差的产生:

image_1ebjbp3121dvb14031pf2981bjf1t.png-41.5kB

七、实验结论:

K-means算法能够有效地将无标签的数据进行自主聚类,是一种非常实用的无监督学习方式。实验结果如上,基本达到预期效果

八、总结及心得体会:

K-Means算法让人又爱又恨,他虽然十分好用,能够花费很少的计算开销就方便地得到聚类的结果,但是由于其聚类结果与初始随机聚类中心的取值有关,所以不得不多进行几次试验来去除随机误差的影响,而这无形中又增加了计算量

九、对本实验过程及方法的改进建议:

整体的实现过程没有什么问题,可以再研究一下自动调参的功能,让kthreshold这两个参数能够自动得出,效果会更好

报告评分:          

指导老师签字:          


附录

实验一代码

数据归一化

Normalization.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. import xlwt
  4. import xlrd
  5. from xlutils.copy import copy
  6. class Data():
  7. def __init__(self,sheet_data):
  8. self.maxA = 0
  9. self.minA = 0
  10. self.ori_data = self.get_data(sheet_data)
  11. self.normalized_data = self.ori_data
  12. self.length = len(self.ori_data)
  13. self.GetMinAndMax()
  14. def get_data(self,sheet_data):
  15. #获取excel中存储的数据
  16. data = []
  17. i = 0
  18. while True:
  19. try:
  20. data.append(sheet_data.cell(i+1,0).value)
  21. i += 1
  22. except IndexError:
  23. break
  24. return data
  25. def GetMinAndMax(self):
  26. #获取数据中的最大值和最小值
  27. temp_min = self.ori_data[0]
  28. temp_max = self.ori_data[0]
  29. for i in range(self.length):
  30. if self.ori_data[i] <= temp_min:
  31. temp_min = self.ori_data[i]
  32. if self.ori_data[i] >= temp_max:
  33. temp_max = self.ori_data[i]
  34. self.maxA = temp_max
  35. self.minA = temp_min
  36. def Normalize(self):
  37. #数据归一化操作
  38. for i in range(self.length):
  39. self.normalized_data[i] = (self.ori_data[i] - self.minA)/(self.maxA - self.minA)
  40. # print(round(self.normalized_data[i],2))
  41. def RecordNormalizedData(self,sheet):
  42. #记录归一化的数据
  43. sheet.write(0,1,'Normalized Data')
  44. for i in range(self.length):
  45. sheet.write(i+1,1,round(self.normalized_data[i],2))
  46. book = xlrd.open_workbook('data1.xlsx')
  47. sheet_data = book.sheet_by_index(0)
  48. book_wt = copy(book)
  49. sheet_wt = book_wt.get_sheet(0)
  50. test_1 = Data(sheet_data)
  51. #数据归一化
  52. test_1.Normalize()
  53. test_1.RecordNormalizedData(sheet_wt)
  54. book_wt.save('data1.xlsx')

点此返回

缺失值处理

fill_blank.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. import xlrd
  4. import xlwt
  5. from xlutils.copy import copy
  6. class Data():
  7. def __init__(self,sheet):
  8. self.ori_data = self.get_data(sheet)
  9. self.length = len(self.ori_data)
  10. self.filled_data = self.ori_data
  11. def get_data(self,sheet):
  12. #获取excel中存储的数据
  13. data = []
  14. i = 0
  15. while True:
  16. try:
  17. data.append(sheet.cell(i+1,0).value)
  18. i += 1
  19. except IndexError:
  20. break
  21. return data
  22. def FillBlank(self):
  23. count = 0
  24. data_sum = 0
  25. blank_index = []
  26. for i in range(self.length):
  27. if self.ori_data[i] is not '':
  28. count += 1
  29. data_sum += self.ori_data[i]
  30. else:
  31. blank_index.append(i)
  32. aver = data_sum / count
  33. for i in blank_index:
  34. self.filled_data[i] = round(aver,2)
  35. def RcordFilledData(self,sheet):
  36. sheet.write(0,1,'Filled Data')
  37. for i in range(self.length):
  38. sheet.write(i+1,1,self.filled_data[i])
  39. data_excel = xlrd.open_workbook('data2.xlsx')
  40. data_sheet = data_excel.sheet_by_index(0)
  41. data_wt = copy(data_excel)
  42. sheet_wt = data_wt.get_sheet(0)
  43. data = Data(data_sheet)
  44. data.FillBlank()
  45. data.RcordFilledData(sheet_wt)
  46. data_wt.save('data2.xlsx')

点此返回

特征筛选

feature_select.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. import xlrd
  4. import xlwt
  5. from xlutils.copy import copy
  6. import math
  7. class Data():
  8. def __init__(self,sheet):
  9. self.nrows = sheet.nrows
  10. self.ncols = sheet.ncols
  11. self.table = self.get_data(sheet)
  12. self.Ent_D = self.get_Ent_D()
  13. def get_data(self,sheet):
  14. temp_table = []
  15. for i in range(self.ncols):
  16. temp_col = sheet.col_values(i)[0:self.nrows]
  17. temp_table.append(temp_col)
  18. return temp_table
  19. def get_feature_information(self,n):
  20. #获取第n+1列的相关信息
  21. temp_feature = Feature(self.table[n][0],self.Ent_D)
  22. dict_True = {}
  23. dict_False = {}
  24. #将数据统计进dict_True和dict_False中
  25. for i in range(1,self.nrows):
  26. if self.table[self.ncols-1][i] == '是':
  27. if self.table[n][i] in dict_True:
  28. dict_True[self.table[n][i]] += 1
  29. else:
  30. temp_dict = {self.table[n][i]:1}
  31. dict_True.update(temp_dict)
  32. elif self.table[self.ncols-1][i] == '否':
  33. if self.table[n][i] in dict_False:
  34. dict_False[self.table[n][i]] += 1
  35. else:
  36. temp_dict = {self.table[n][i]:1}
  37. dict_False.update(temp_dict)
  38. #保证两词典的键一致
  39. for i in dict_True:
  40. if i not in dict_False:
  41. temp_dict = {i:0}
  42. dict_False.update(temp_dict)
  43. for i in dict_False:
  44. if i not in dict_True:
  45. temp_dict = {i:0}
  46. dict_True.update(temp_dict)
  47. temp_feature.dict_True = dict_True
  48. temp_feature.dict_False = dict_False
  49. #将属性的每个状态存储在列表status中
  50. for i in dict_True:
  51. if i not in temp_feature.status:
  52. temp_feature.status.append(i)
  53. for i in dict_False:
  54. if i not in temp_feature.status:
  55. temp_feature.status.append(i)
  56. temp_feature.information_gain = temp_feature.get_information_gain()
  57. return temp_feature
  58. def get_Ent_D(self):
  59. #获取根节点的信息熵Ent_D
  60. count_true = 0
  61. count_false = 0
  62. for i in self.table[self.ncols-1]:
  63. if i == '是':
  64. count_true += 1
  65. elif i == '否':
  66. count_false += 1
  67. p_true = count_true/(count_true + count_false)
  68. p_false = count_false/(count_true + count_false)
  69. Ent_D = 0 - (p_true * math.log(p_true,2) + p_false * math.log(p_false,2))
  70. return Ent_D
  71. class Feature():
  72. #对象的其中一个属性,例如'色泽'
  73. def __init__(self,name,Ent_D):
  74. self.name = name
  75. self.dict_True = {}
  76. self.dict_False = {}
  77. self.status = []
  78. self.Ent_D = Ent_D
  79. self.information_gain = 0
  80. def get_information_gain(self):
  81. #获取该属性的信息增益
  82. Ent_D_i = [] #存储该属性各个状态的信息熵
  83. p_D_i = [] #存储该属性各个状态的比例
  84. sample_quantity = 0
  85. result = 0
  86. for i in self.dict_True:
  87. sample_quantity += self.dict_True[i]
  88. for i in self.dict_False:
  89. sample_quantity += self.dict_False[i]
  90. for i in self.status:
  91. total = self.dict_True[i] + self.dict_False[i]
  92. if self.dict_True[i] != 0:
  93. if self.dict_False[i] != 0:
  94. temp_Ent_D_i = 0 - ((self.dict_True[i]/total) * math.log(self.dict_True[i]/total,2) + (self.dict_False[i]/total) * math.log(self.dict_False[i]/total,2))
  95. else:
  96. temp_Ent_D_i = 0 - self.dict_True[i]/total * math.log(self.dict_True[i]/total,2)
  97. else:
  98. temp_Ent_D_i = 0 - self.dict_False[i]/total * math.log(self.dict_False[i]/total,2)
  99. #input()
  100. temp_p = total/sample_quantity
  101. Ent_D_i.append(temp_Ent_D_i)
  102. p_D_i.append(temp_p)
  103. for i in range(len(Ent_D_i)):
  104. result += Ent_D_i[i] * p_D_i[i]
  105. result = self.Ent_D - result
  106. #print('gain is :',result)
  107. return result
  108. book = xlrd.open_workbook('data3.xlsx')
  109. sheet_rd = book.sheet_by_index(0)
  110. book_wt = copy(book)
  111. sheet_wt = book_wt.get_sheet(0)
  112. data = Data(sheet_rd)
  113. features = []
  114. for i in range(1,data.ncols-3):
  115. temp_feature = data.get_feature_information(i)
  116. features.append(temp_feature)
  117. #print(data.table)
  118. sheet_wt2 = book_wt.get_sheet(1)
  119. sheet_wt2.write(1,0,'信息增益')
  120. for i in range(1,data.ncols-3):
  121. sheet_wt2.write(0,i,features[i-1].name)
  122. sheet_wt2.write(1,i,round(features[i-1].information_gain,3))
  123. book_wt.save('data3.xlsx')
  124. print('Ent_D of the root note is:',round(data.Ent_D,3))
  125. for i in features:
  126. print('the information gain of',i.name,'is :',round(i.information_gain,3))

点此返回

实验二代码

关联规则挖掘:

dig_rules.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. import xlrd
  4. import xlwt
  5. class Node():
  6. def __init__(self,total):
  7. self.content = []
  8. self.data = []
  9. self.sup = 0
  10. self.total = total
  11. #更新当前层,剪枝
  12. def UpdateLayer(ori_list,threshold):
  13. temp_list = []
  14. for i in ori_list:
  15. if i.sup > threshold:
  16. temp_list.append(i)
  17. return temp_list
  18. #连接上层生成下层
  19. def GenNewLayer(ori_list):
  20. new_list = []
  21. for i in ori_list:
  22. for j in ori_list:
  23. if i != j:
  24. new_list.append(Merge(i,j))
  25. return new_list
  26. #Merge函数合并a,b两个结点
  27. def Merge(a,b):
  28. temp = Node(a.total)
  29. #合并self.content的内容,并去重
  30. for i in a.content:
  31. temp.content.append(i)
  32. for i in b.content:
  33. if i not in a.content:
  34. temp.content.append(i)
  35. #统计data和sup的内容
  36. count = 0
  37. for i in range(a.total):
  38. if a.data[i] == b.data[i] and a.data[i] == 1:
  39. count += 1
  40. temp.data.append(1)
  41. else:
  42. temp.data.append(0)
  43. temp.sup = count
  44. return temp
  45. def ReadData(sheet):
  46. temp_layer = []
  47. for i in range(1,sheet.ncols):
  48. temp_node = Node(sheet.nrows - 1)
  49. temp_node.content.append(sheet.cell(0,i).value)
  50. count = 0
  51. for j in range(1,sheet.nrows):
  52. if sheet.cell(j,i).value == 1:
  53. count += 1
  54. temp_node.data.append(int(sheet.cell(j,i).value))
  55. temp_node.sup = count
  56. temp_layer.append(temp_node)
  57. return temp_layer
  58. book = xlrd.open_workbook('data.xlsx')
  59. sheet = book.sheet_by_index(0)
  60. layer1 = ReadData(sheet)
  61. layer1 = UpdateLayer(layer1,2)
  62. layer2 = GenNewLayer(layer1)
  63. layer2 = UpdateLayer(layer2,2)
  64. for i in layer2:
  65. print(i.content,';',i.data,';',i.sup/i.total)

点此返回

实验三代码

临时锚点

KNN分类

knn_train&test.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. #KNN
  4. from scipy.io import arff
  5. import pandas as pd
  6. import math
  7. class Node():
  8. def __init__(self):
  9. self.petallength = 0
  10. self.petalwidth = 0
  11. self.classes =''
  12. def GetData(df):
  13. #读取arff文件,将每一行存成一个Node类,存在data_list列表中
  14. data_list = []
  15. ptl = df['petallength'].values
  16. ptw = df['petalwidth'].values
  17. classes = df['class'].values
  18. num = len(ptl)
  19. for i in range(num):
  20. temp_node = Node()
  21. temp_node.petallength = ptl[i]
  22. temp_node.petalwidth = ptw[i]
  23. temp_node.classes = classes[i]
  24. data_list.append(temp_node)
  25. return data_list
  26. def GetDist(node_a,node_b):
  27. #计算两点的距离
  28. distance = math.sqrt((node_a.petallength - node_b.petallength)**2 + (node_a.petalwidth - node_b.petalwidth)**2)
  29. return distance
  30. def Predict(train_list,node_a,k):
  31. #开始预测数据
  32. dist_list = [] #存放离各点距离
  33. index_list = [] #存放K邻近的点的索引index
  34. for i in train_list:
  35. dist_list.append(GetDist(node_a,i))
  36. dist_copy = dist_list
  37. for i in range(k): #寻找k邻近的索引值,并存放在index_list中
  38. temp_min = 999
  39. temp_index = 0
  40. for j in range(len(train_list)):
  41. if dist_copy[j] <= temp_min:
  42. temp_min = dist_copy[j]
  43. temp_index = j
  44. dist_copy[temp_index] = 999
  45. index_list.append(temp_index)
  46. classes = {}
  47. #统计k邻近的分类结果,并以字典形式存放在classes中
  48. for i in index_list:
  49. if train_list[i].classes not in classes:
  50. classes[train_list[i].classes] = 1
  51. else:
  52. classes[train_list[i].classes] += 1
  53. result = ''
  54. max_times = 0
  55. #选取次数最多的分类结果并存放在result中
  56. for i in classes:
  57. if classes[i] >= max_times:
  58. max_times = classes[i]
  59. result = i
  60. return result
  61. def RunTest(train_list,test_list,k):
  62. #开始运行测试
  63. count = 0
  64. for i in test_list:
  65. result = Predict(train_list,i,k)
  66. if i.classes == result: #判断预测结果与标签是否相符
  67. count += 1
  68. print('sample',test_list.index(i),'is correctly classified as',i.classes,'!')
  69. rate = float(count / len(test_list))
  70. return rate
  71. data_train = arff.loadarff('iris.2D.train.arff')
  72. df_train = pd.DataFrame(data_train[0])
  73. train_list = GetData(df_train)
  74. data_test = arff.loadarff('iris.2D.test.arff')
  75. df_test = pd.DataFrame(data_test[0])
  76. test_list = GetData(df_test)
  77. '''
  78. for i in test_list:
  79. print(i.petallength,i.petalwidth,i.classes)
  80. '''
  81. rate = RunTest(train_list,test_list,3)
  82. #输出最后的预测准确率
  83. print('the accuracy of classification is:',rate)

点此返回

决策树分类

decision_tree.py

  1. #!/user/bin/python3
  2. #coding=utf-8
  3. #KNN
  4. from scipy.io import arff
  5. import pandas as pd
  6. import numpy as np
  7. import math
  8. class Node():
  9. def __init__(self):
  10. self.petallength = 0
  11. self.petalwidth = 0
  12. self.classes =''
  13. class Tree():
  14. def __init__(self):
  15. self.name = ''
  16. self.standard = 0
  17. self.left = None
  18. self.right = None
  19. def GetData(df):
  20. #读取arff文件,将每一行存成一个Node类,存在data_list列表中
  21. data_list = []
  22. ptl = df['petallength'].values
  23. ptw = df['petalwidth'].values
  24. classes = df['class'].values
  25. num = len(ptl)
  26. for i in range(num):
  27. temp_node = Node()
  28. temp_node.petallength = ptl[i]
  29. temp_node.petalwidth = ptw[i]
  30. temp_node.classes = classes[i]
  31. data_list.append(temp_node)
  32. return data_list
  33. def GenTree(train_list):
  34. root = Tree()
  35. gain = []
  36. gain.append(GetGain(train_list,'petallength'))
  37. gain.append(GetGain(train_list,'petalwidth'))
  38. if gain[0] >= gain[1]:
  39. root.name = 'petallength'
  40. root.standard = GetMedian(train_list,'petallength')
  41. root.left = Tree()
  42. root.left.name = 'petalwidth'
  43. root.left.standard = GetMedian(train_list,'petalwidth')
  44. root.left.left = Tree()
  45. root.left.left.name = '1'
  46. root.left.right = Tree()
  47. root.left.right.name = '2'
  48. root.right = Tree()
  49. root.right.name = 'petalwidth'
  50. root.right.standard = GetMedian(train_list,'petalwidth ')
  51. root.right.left = Tree()
  52. root.right.left.name = '3'
  53. root.right.right = Tree()
  54. root.right.right.name = '4'
  55. else:
  56. root.name = 'petalwidth'
  57. root.standard = GetMedian(train_list,'petalwidth')
  58. root.left = Tree()
  59. root.left.name = 'petallength'
  60. root.left.standard = GetMedian(train_list,'petallength')
  61. root.left.left = Tree()
  62. root.left.left.name = '1'
  63. root.left.right = Tree()
  64. root.left.right.name = '2'
  65. root.right = Tree()
  66. root.right.name = 'petallength'
  67. root.right.standard = GetMedian(train_list,'petallength')
  68. root.right.left = Tree()
  69. root.right.left.name = '3'
  70. root.right.right = Tree()
  71. root.right.right.name = '4'
  72. return root
  73. def GetGain(train_list,name):
  74. #计算信息增益
  75. if name is 'petallength':
  76. data_list = []
  77. for i in train_list:
  78. data_list.append(i.petallength)
  79. median = np.median(data_list)
  80. ent_d = GetEnt(train_list)
  81. d_list = []
  82. temp_list = []
  83. for i in train_list:
  84. if i.petallength >= median:
  85. temp_list.append(i)
  86. d_list.append(temp_list)
  87. gain = ent_d
  88. for i in d_list:
  89. gain = gain - len(i)*GetEnt(i)/len(train_list)
  90. return gain
  91. elif name is 'petalwidth':
  92. data_list = []
  93. for i in train_list:
  94. data_list.append(i.petalwidth)
  95. median = np.median(data_list)
  96. ent_d = GetEnt(train_list)
  97. d_list = []
  98. temp_list = []
  99. for i in train_list:
  100. if i.petalwidth >= median:
  101. temp_list.append(i)
  102. d_list.append(temp_list)
  103. gain = ent_d
  104. for i in d_list:
  105. gain = gain - len(i)*GetEnt(i)/len(train_list)
  106. return gain
  107. def GetEnt(train_list):
  108. #计算信息熵
  109. count = {}
  110. for i in train_list:
  111. if i.classes not in count:
  112. count[i.classes] = 1
  113. elif i.classes in count:
  114. count[i.classes] += 1
  115. total = 0
  116. for i in count:
  117. total += count[i]
  118. ent = 0
  119. for i in count:
  120. ent = ent - count[i]/total * math.log(2,count[i]/total)
  121. return ent
  122. def GetMedian(train_list,name):
  123. if name is 'petallength':
  124. data_list = []
  125. for i in train_list:
  126. data_list.append(i.petallength)
  127. median = np.median(data_list)
  128. elif name is 'petalwidth':
  129. data_list = []
  130. for i in train_list:
  131. data_list.append(i.petalwidth)
  132. median = np.median(data_list)
  133. return median
  134. def print_all(root):
  135. if root.left is not None:
  136. print('name:',root.name,'standard:',root.standard)
  137. print_all(root.left)
  138. print_all(root.right)
  139. if root.left is None:
  140. print('type:',root.name)
  141. def RunTest(root,test_list,name):
  142. count = 0
  143. for i in test_list:
  144. if root.name is 'petallength':
  145. #petallength -> petalwidth
  146. if i.petallength >= root.standard:
  147. #left
  148. if i.petalwidth >= root.left.standard:
  149. #left.left
  150. pred = root.left.left.name
  151. else:
  152. ##left.right
  153. pred = root.left.right.name
  154. else:
  155. #right
  156. if i.petalwidth >= root.right.standard:
  157. #right.left
  158. pred = root.right.left.name
  159. else:
  160. #right.right
  161. pred = root.right.right.name
  162. elif root.name is 'petalwidth':
  163. #petalwidth -> petallength
  164. if i.petalwidth >= root.standard:
  165. #left
  166. if i.petallength >= root.left.standard:
  167. #left.left
  168. pred = root.left.left.name
  169. else:
  170. ##left.right
  171. pred = root.left.right.name
  172. else:
  173. #right
  174. if i.petallength >= root.right.standard:
  175. #right.left
  176. pred = root.right.left.name
  177. else:
  178. #right.right
  179. pred = root.right.right.name
  180. result = ''
  181. if pred is '1':
  182. result = name[2]
  183. elif pred is '2':
  184. result = name[1]
  185. elif pred is '3' or pred is '4':
  186. result = name[0]
  187. # print('i.classes:',i.classes,'; result:',result)
  188. if result == i.classes:
  189. count += 1
  190. print('case',test_list.index(i),'is correctly predicted as',i.classes)
  191. rate = float(count/len(test_list))
  192. return rate
  193. data_train = arff.loadarff('iris.2D.train.arff')
  194. df_train = pd.DataFrame(data_train[0])
  195. train_list = GetData(df_train)
  196. data_test = arff.loadarff('iris.2D.test.arff')
  197. df_test = pd.DataFrame(data_test[0])
  198. test_list = GetData(df_test)
  199. decision_tree = GenTree(train_list)
  200. name = []
  201. for i in train_list:
  202. if i.classes not in name:
  203. name.append(i.classes)
  204. #print_all(decision_tree)
  205. rate = RunTest(decision_tree,test_list,name)
  206. print('the accuracy of prediction is:',rate)

点此返回

实验四代码

临时锚点2

K-Means聚类算法

k_means.py

  1. #!user/bin/python3
  2. #coding=utf-8
  3. import xlrd
  4. import xlwt
  5. import matplotlib.pyplot as plt
  6. import math
  7. import random
  8. class KMeans():
  9. #存放kmeans结果的类
  10. def __init__(self,data_set,k):
  11. self.data_set = data_set #原始数据集
  12. self.num = k #聚类
  13. self.data_result = self.InitResult(k) #存放聚类结果
  14. def InitResult(self,k):
  15. result_list = []
  16. for i in range(k):
  17. temp_list = [[],[]]
  18. result_list.append(temp_list)
  19. return result_list
  20. def GetData(sheet):
  21. data_set = []
  22. col_1 = sheet.col_values(0)[1:]
  23. col_2 = sheet.col_values(1)[1:]
  24. data_set.append(col_1)
  25. data_set.append(col_2)
  26. return data_set
  27. def DrawScatter(result,colors,centers):
  28. #转换一下centers的格式
  29. centers_trans = []
  30. temp_centersx = []
  31. temp_centersy = []
  32. for i in centers:
  33. temp_centersx.append(i[0])
  34. temp_centersy.append(i[1])
  35. centers_trans = [temp_centersx,temp_centersy]
  36. for i in range(len(result.data_result)):
  37. plt.scatter(result.data_result[i][0],result.data_result[i][1],c = colors[i],marker = '.')
  38. plt.scatter(centers_trans[0],centers_trans[1],c = colors[len(result.data_result) + 1],marker = 'o')
  39. plt.show()
  40. def GetCluster(centers,data_set):
  41. #根据当前的聚类中心centers进行聚类划分,返回聚类结果
  42. result = KMeans(data_set,len(centers))
  43. for i in range(len(data_set[0])):
  44. #遍历每个点
  45. min_dist = 99999
  46. min_index = 0
  47. for j in range(len(centers)):
  48. #遍历聚类中心
  49. temp_dist = math.sqrt((data_set[0][i] - centers[j][0])**2 + (data_set[1][i] - centers[j][1])**2)
  50. if temp_dist <= min_dist:
  51. min_dist = temp_dist
  52. min_index = j
  53. result.data_result[min_index][0].append(data_set[0][i])
  54. result.data_result[min_index][1].append(data_set[1][i])
  55. return result
  56. def UpdateCenters(centers,cluster,threshold):
  57. #寻找新的聚类中心
  58. new_centers = []
  59. for i in range(len(centers)):
  60. #遍历每个现有的聚类中心
  61. distance = []
  62. for j in range(len(cluster[i][0])):
  63. #遍历第i类中的每个点,将之作为临时中心
  64. temp_center = [cluster[i][0][j],cluster[i][1][j]]
  65. dists = 0
  66. for m in range(len(cluster[i][0])):
  67. #计算以该点作为聚类中心的各点距离
  68. temp_dist = math.sqrt((temp_center[0] - cluster[i][0][m])**2 + (temp_center[1] - cluster[i][1][m])**2)
  69. dists += temp_dist
  70. distance.append(dists)
  71. min_dist = distance[0]
  72. for n in distance:
  73. #找到距离和最小的临时点
  74. if n <= min_dist:
  75. min_dist = n
  76. min_index = distance.index(min_dist) #存放距离和最小点的索引
  77. temp_center = [cluster[i][0][min_index],cluster[i][1][min_index]]
  78. #判定,当新老中心点距离delta小于一个阈值时,则认为已经不用继续了
  79. delta = (centers[i][0] - temp_center[0])**2 + (centers[i][1] - temp_center[1])**2
  80. if delta >= threshold:
  81. new_centers.append(temp_center)
  82. else:
  83. new_centers.append(centers[i])
  84. return new_centers
  85. def RunTest(data_set,k,colors,threshold):
  86. new_centers = []
  87. centers = []
  88. randlist = random.sample(range(0,len(data_set[0])),k)
  89. #初始化k个聚类中心
  90. for i in randlist:
  91. temp_center = [data_set[0][i],data_set[1][i]]
  92. new_centers.append(temp_center)
  93. result = KMeans(data_set,k)
  94. #经过更新后的聚类中心,如果没有变化,就停止迭代
  95. count = 0
  96. while new_centers != centers:
  97. centers = new_centers
  98. result = GetCluster(centers,data_set)
  99. new_centers = UpdateCenters(centers,result.data_result,threshold)
  100. DrawScatter(result,colors,centers)
  101. book = xlrd.open_workbook('dataset.xlsx')
  102. sheet = book.sheet_by_index(0)
  103. data_set = GetData(sheet)
  104. colors = ['b','r','y','k','g','m','c','w']
  105. RunTest(data_set,3,colors,0.01)
  106. '''
  107. for i in range(10):
  108. RunTest(data_set,i,colors)
  109. '''

点此返回

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注