[关闭]
@EVA001 2017-12-09T13:29:10.000000Z 字数 9217 阅读 546

date: 2017-11-12
categories: 论文研读
[Python,数据集]
comments: true

title: DBLP数据集使用Python解析

dblp的使用

总的来说,DBLP集成元素不多,只有最基本的论文题目,时间,作者,发表类型及期刊或会议名称等等。可能很多人想要的标签、关键词都没有。但是,基于DBLP数据集这些基本的元素,可以挖掘、利用的也是很多。例如官网给出的统计信息,就能引申出很多东西。
涉及到DBLP,我能一下想到的关键词:经典的复杂网络,小世界,无标度,合作关系网,关系推荐,聚类,连接预测,随机游走,中心作者分析,作者影响力分析,研究热点发展等等,非常多。因此,DBLP是个很丰富宝贵的资源。
引述自:
http://blog.csdn.net/frontend922/article/details/18552077

dblp下载

dblp.dtd        2017-08-29 16:23    13K  
dblp.xml.gz     2017-11-10 20:26    393M
XML下载链接     http://dblp.uni-trier.de/xml/

dblp原始数据集示例

  1. <?xml version="1.0" encoding="ISO-8859-1"?>
  2. <!DOCTYPE dblp SYSTEM "dblp.dtd">
  3. <dblp>
  4. <article mdate="2017-05-28" key="journals/acta/Saxena96">
  5. <author>Sanjeev Saxena</author>
  6. <title>Parallel IntegerSimulation Amongst CRCW Models.</title>
  7. <pages>607-619</pages>
  8. <year>1996</year>
  9. <volume>33</volume>
  10. <journal>Acta Inf.</journal>
  11. <number>7</number>
  12. <url>db/journals/acta/acta33.html#Saxena96</url>
  13. <ee>https://doi.org/10.1007/BF03036466</ee>
  14. </article>
  15. <article mdate="2017-05-28" key="journals/acta/Simon83">
  16. <author>Hans Ulrich Simon</author>
  17. <title>Pattern Matching in Trees and Nets.</title>
  18. <pages>227-248</pages>
  19. <year>1983</year>
  20. <volume>20</volume>
  21. <journal>Acta Inf.</journal>
  22. <url>db/journals/acta/acta20.html#Simon83</url>
  23. <ee>https://doi.org/10.1007/BF01257084</ee>
  24. </article>
  25. </dblp>

dblp数据集建表语句

  1. /*
  2. Navicat MySQL Data Transfer
  3. Source Server : localmysql
  4. Source Server Version : 50540
  5. Source Host : localhost:3306
  6. Source Database : visual_dataset
  7. Target Server Type : MYSQL
  8. Target Server Version : 50540
  9. File Encoding : 65001
  10. Date: 2017-11-11 17:44:38
  11. */
  12. SET FOREIGN_KEY_CHECKS=0;
  13. -- ----------------------------
  14. -- Table structure for dblp
  15. -- ----------------------------
  16. DROP TABLE IF EXISTS `dblp`;
  17. CREATE TABLE `dblp` (
  18. `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  19. `article_mdate` varchar(255) DEFAULT NULL,
  20. `article_key` varchar(255) DEFAULT NULL,
  21. `author` varchar(255) DEFAULT NULL,
  22. `title` varchar(255) DEFAULT NULL,
  23. `pages` varchar(255) DEFAULT NULL,
  24. `year` varchar(255) DEFAULT NULL,
  25. `volume` varchar(255) DEFAULT NULL,
  26. `journal` varchar(255) DEFAULT NULL,
  27. `number` varchar(255) DEFAULT NULL,
  28. `url` varchar(255) DEFAULT NULL,
  29. `ee` varchar(255) DEFAULT NULL,
  30. `x2` varchar(255) DEFAULT NULL,
  31. PRIMARY KEY (`id`)
  32. ) ENGINE=MyISAM DEFAULT CHARSET=gbk;

将dblp.xml解析到文件中的代码

  1. # -*- coding: utf-8 -*-
  2. """
  3. 原代码只将数据解析到文本,且对重复字段没有进行处理
  4. <article>
  5. <author>Mr.A</author>
  6. <author>Mr.B</author>
  7. </article>
  8. 此代码修正了上述不足,然后将解析后字段导入数据库
  9. 读取数据:dblp.xml 2.01G
  10. 导入Mysql:170万+
  11. 导入表:visual_dataset.dblp
  12. 生成备份文件:insert.sql
  13. @author: Administrator
  14. """
  15. #!/usr/bin/python
  16. # -*- coding: UTF-8 -*-
  17. from __future__ import print_function
  18. import xml.sax
  19. import sys
  20. import io
  21. import re
  22. import logging
  23. import traceback
  24. import pymysql.cursors
  25. sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') #改变标准输出的默认编码
  26. logging.basicConfig(level=logging.DEBUG,
  27. format='%(message)s',
  28. datefmt='%a, %d %b %Y %H:%M:%S',
  29. filename='I:\\ABC000000000000\\Dblp\\simple\\app.log',
  30. filemode='w')
  31. class MovieHandler( xml.sax.ContentHandler ):
  32. '''
  33. res 类变量,记录解析后的字段值
  34. '''
  35. athr = []
  36. ee = []
  37. res=''
  38. sqlval=''
  39. def __init__(self):
  40. self.CurrentData = ""
  41. self.author = ""
  42. self.title = ""
  43. self.pages = ""
  44. self.year = ""
  45. self.volume = ""
  46. self.journal = ""
  47. self.number = ""
  48. self.url = ""
  49. self.ee = ""
  50. # 元素开始事件处理,对每个顶级标签内数据的解析都会重复的调用此方法
  51. def startElement(self, tag, attributes):
  52. self.CurrentData = tag
  53. if tag == "article":
  54. try:
  55. if len(self.__class__.sqlval) :
  56. #print(re.sub(",$","",self.__class__.sqlval))
  57. lt = re.sub(",$","",self.__class__.sqlval).split(",")
  58. lt2= sorted(set(lt),key=lt.index)
  59. insert_mysql(
  60. ','.join(lt2),self.__class__.res,
  61. ','.join(self.__class__.athr),
  62. ','.join(self.__class__.ee)
  63. )
  64. except:
  65. traceback.print_exc()
  66. #清空res变量,由于跨方法拼字符串,所以使用了类变量
  67. self.__class__.res=''
  68. self.__class__.sqlval=''
  69. self.__class__.athr=[]
  70. self.__class__.ee=[]
  71. #因为处在if判断后,所以只解析第一个标签内的属性值
  72. mdate = attributes["mdate"]
  73. key = attributes["key"]
  74. #拼接字符串
  75. self.__class__.res += mdate + SYMBOL + key + SYMBOL
  76. self.__class__.sqlval += "article_mdate,article_key,"
  77. # 经过开始事件->内容事件的方法之后,调用此结束事件处理,
  78. # 对先前内容事件方法中对实例变量的值进行统一过滤处理
  79. def endElement(self, tag):
  80. if self.CurrentData == "author":
  81. self.__class__.sqlval += "author,"
  82. if '$_author_$' not in self.__class__.res:
  83. self.__class__.res += "$_author_$" + SYMBOL
  84. self.__class__.athr.append(self.author)
  85. elif self.CurrentData == "title":
  86. self.__class__.sqlval += "title,"
  87. self.__class__.res += self.title + SYMBOL
  88. elif self.CurrentData == "pages":
  89. self.__class__.sqlval += "pages,"
  90. self.__class__.res += self.pages + SYMBOL
  91. elif self.CurrentData == "year":
  92. self.__class__.sqlval += "year,"
  93. self.__class__.res += self.year + SYMBOL
  94. elif self.CurrentData == "volume":
  95. self.__class__.sqlval += "volume,"
  96. self.__class__.res += self.volume + SYMBOL
  97. elif self.CurrentData == "journal":
  98. self.__class__.sqlval += "journal,"
  99. self.__class__.res += self.journal + SYMBOL
  100. elif self.CurrentData == "number":
  101. self.__class__.sqlval += "number,"
  102. self.__class__.res += self.number + SYMBOL
  103. elif self.CurrentData == "url":
  104. self.__class__.sqlval += "url,"
  105. self.__class__.res += self.url + SYMBOL
  106. elif self.CurrentData == "ee":
  107. self.__class__.sqlval += "ee,"
  108. if '$_ee_$' not in self.__class__.res:
  109. self.__class__.res += "$_ee_$" + SYMBOL
  110. self.__class__.ee.append(self.ee)
  111. self.CurrentData = ""
  112. # 内容事件处理,对每个子元素都执行此方法,并且重置实例变量的值
  113. def characters(self, content):
  114. if self.CurrentData == "author":
  115. self.author = content.replace("'","`")
  116. elif self.CurrentData == "title":
  117. self.title = content.replace("'","`")
  118. elif self.CurrentData == "pages":
  119. self.pages = content.replace("'","`")
  120. elif self.CurrentData == "year":
  121. self.year = content.replace("'","`")
  122. elif self.CurrentData == "volume":
  123. self.volume = content.replace("'","`")
  124. elif self.CurrentData == "journal":
  125. self.journal = content.replace("'","`")
  126. elif self.CurrentData == "number":
  127. self.number = content.replace("'","`")
  128. elif self.CurrentData == "url":
  129. self.url = content.replace("'","`")
  130. elif self.CurrentData == "ee":
  131. self.ee = content.replace("'","`")
  132. #class结束
  133. '''
  134. 独立方法:将解析出的字段导入Mysql
  135. '''
  136. def insert_mysql(names,values,authors,ees):
  137. global count
  138. if count==100:
  139. sys.exit
  140. val = re.sub(",'$","",values)
  141. val = re.sub("#","&",val)
  142. val = val.replace("$_ee_$",re.sub(",",",",ees))
  143. val = val.replace("$_author_$",re.sub(",",",",authors))
  144. sql = ''
  145. if len(names) & len(names):
  146. try:
  147. #存入Mysql via:github.com/PyMySQL/PyMySQL
  148. with connection.cursor() as cursor:
  149. sql = "INSERT INTO `dblp` ("
  150. sql +=names
  151. sql +=" )VALUES ('"
  152. sql +=val
  153. sql +=" )"
  154. count += 1
  155. print('parse items and inserted :'+str(count))
  156. if sql is not None and sql != 'None':
  157. logging.info(sql+';')
  158. cursor.execute(sql)
  159. #创建的connection是非自动提交,需要手动commit
  160. connection.commit()
  161. a = 1
  162. except:
  163. logging.error(traceback.print_exc())
  164. #这里直接运行,则本身__name__就是__main__
  165. if ( __name__ == "__main__"):
  166. count = 0
  167. #定义全局分隔符
  168. SYMBOL = "','"
  169. XMLFPATH = "I:\\ABC000000000000\\Dblp\\dblp.xml"
  170. parser = xml.sax.make_parser()
  171. parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  172. Handler = MovieHandler()
  173. parser.setContentHandler( Handler )
  174. connection = pymysql.connect(
  175. host='localhost',
  176. user='root',
  177. password='123',
  178. db='visual_dataset',
  179. charset='utf8mb4',
  180. cursorclass=pymysql.cursors.DictCursor)
  181. parser.parse(XMLFPATH)
  182. connection.close()

原代码来源于网络

  1. # -*- coding: utf-8 -*-
  2. """
  3. 解析dblp.xml,将结果存入dblp_result.txt内
  4. @author: Administrator
  5. """
  6. #!/usr/bin/python
  7. # -*- coding: UTF-8 -*-
  8. from __future__ import print_function
  9. import xml.sax
  10. import sys
  11. import io
  12. import traceback
  13. sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8')
  14. class MovieHandler( xml.sax.ContentHandler ):
  15. '''
  16. res 类变量,记录解析后的字段值
  17. '''
  18. res=''
  19. def __init__(self):
  20. self.CurrentData = ""
  21. self.author = ""
  22. self.title = ""
  23. self.pages = ""
  24. self.year = ""
  25. self.volume = ""
  26. self.journal = ""
  27. self.number = ""
  28. self.url = ""
  29. self.ee = ""
  30. # 元素开始事件处理
  31. def startElement(self, tag, attributes):
  32. self.CurrentData = tag
  33. if tag == "article":
  34. print("self.__class__.res=",self.__class__.res)
  35. try:
  36. ww.write(self.__class__.res + '\n')
  37. except:
  38. traceback.print_exc()
  39. #清空res变量,由于跨方法拼字符串,所以使用了类变量
  40. self.__class__.res=''
  41. #因为处在if判断后,所以只解析第一个标签内的属性值
  42. mdate = attributes["mdate"]
  43. key = attributes["key"]
  44. #拼接字符串
  45. self.__class__.res=self.__class__.res + mdate + ';,;' + key + ';,;'
  46. # 元素结束事件处理
  47. def endElement(self, tag):
  48. if self.CurrentData == "author":
  49. #print ("author:", self.author)
  50. self.__class__.res=self.__class__.res + self.author + ';,;'
  51. elif self.CurrentData == "title":
  52. #print ("title:", self.title)
  53. self.__class__.res=self.__class__.res + self.title + ';,;'
  54. elif self.CurrentData == "pages":
  55. #print ("pages:", self.pages)
  56. self.__class__.res=self.__class__.res + self.pages + ';,;'
  57. elif self.CurrentData == "year":
  58. #print ("year:", self.year)
  59. self.__class__.res=self.__class__.res + self.year + ';,;'
  60. elif self.CurrentData == "volume":
  61. #print ("volume:", self.volume)
  62. self.__class__.res=self.__class__.res + self.volume + ';,;'
  63. elif self.CurrentData == "journal":
  64. #print ("journal:", self.journal)
  65. self.__class__.res=self.__class__.res + self.journal + ';,;'
  66. elif self.CurrentData == "number":
  67. #print ("number:", self.number)
  68. self.__class__.res=self.__class__.res + self.number + ';,;'
  69. elif self.CurrentData == "url":
  70. #print ("url:", self.url)
  71. self.__class__.res=self.__class__.res + self.url + ';,;'
  72. elif self.CurrentData == "ee":
  73. #print ("ee:", self.ee)
  74. self.__class__.res=self.__class__.res + self.ee + ';,;'
  75. self.CurrentData = ""
  76. # 内容事件处理
  77. def characters(self, content):
  78. if self.CurrentData == "author":
  79. self.author = content
  80. elif self.CurrentData == "title":
  81. self.title = content
  82. elif self.CurrentData == "pages":
  83. self.pages = content
  84. elif self.CurrentData == "year":
  85. self.year = content
  86. elif self.CurrentData == "volume":
  87. self.volume = content
  88. elif self.CurrentData == "journal":
  89. self.journal = content
  90. elif self.CurrentData == "number":
  91. self.number = content
  92. elif self.CurrentData == "url":
  93. self.url = content
  94. elif self.CurrentData == "ee":
  95. self.ee = content
  96. #class结束
  97. #这里直接运行,则本身__name__就是__main__
  98. if ( __name__ == "__main__"):
  99. parser = xml.sax.make_parser()
  100. parser.setFeature(xml.sax.handler.feature_namespaces, 0)
  101. Handler = MovieHandler()
  102. parser.setContentHandler( Handler )
  103. ww=open('I:\\ABC000000000000\\Dblp\\simple\\dblp_result.txt','w+')
  104. parser.parse("I:\\ABC000000000000\\Dblp\\simple\\dblp.xml")
  105. ww.close()

对于dblp数据的使用

(待续)

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注