@tenlee
2015-08-04T08:27:35.000000Z
字数 1929
阅读 1819
Python
廖雪峰的python教程,最近刚出的Python3教程,很不错。(戳我学廖雪峰的Python3)但是由于学校12点就断网,所以就需要一个离线的。
代码
#!/usr/bin/env python3#coding:utf-8import sys, re, fileinputfrom urllib import requestdef getAddress():address = "http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000"ftarge = "22.txt"fsource = "23.txt"i = 0with request.urlopen(address) as f:sdata = f.read()data = sdata.decode("utf-8")with open(fsource, 'w') as fs:fs.write(data)with open(fsource, 'r') as fs:for line in fs.readlines():i = i + 1if (i < 366 or i > 849):continuetemp = re.search(r'/wiki.+00', line, re.S)if temp:with open(ftarge, 'a') as ft:ft.write(temp.group(0) + '\n')getAddress()fileaddress = input("please input the save path(absolute address) ")num = 1addressfile = "22.txt"for line in fileinput.input(addressfile):line = "http://www.liaoxuefeng.com" + line.strip('\n') + '/'with request.urlopen(line) as f:data = f.read()temp = data.decode('utf-8');mainbody = re.search(r'<h4>.+<div class="x-anchor"><a name="comments"></a></div>', temp, re.S)ptitle = re.search(r'<title>(.+)</title>', temp)title = ptitle.group(1)title = title.replace(" ", "")title = title.replace("/", "")filename = fileaddress + str(num) + '_' + title +'.html'num = num + 1;print (filename + " is writing...")head = '<meta charset="UTF-8">'with open(filename, 'w') as f:f.write(head + '\n' + mainbody.group(0))
我先从一个网页中获取Python教程每篇的相对地址, 正则表达式匹配出来保存在22.txt文件,之后在从该文件提取网址,从而获得所有文章,再用正则表达式匹配出来文章的主题并保存成相应html文件。
需要你自己手动输入你要保存网页的绝对地址。
下面一个是Python爬区12306验证码的代码。
#!/usr/bin/env python3#coding:utf-8import timeimport requestsdef download_file(url,local_filename):r = requests.get(url, stream=True,verify = False)with open(local_filename, 'wb') as f:for chunk in r.iter_content(chunk_size=1024):if chunk: # filter out keep-alive new chunksf.write(chunk)f.flush()return local_filenamesavePath = input("please input the save path(absolute address)")url = 'https://kyfw.12306.cn/otn/passcodeNew/getPassCodeNew?module=login&rand=sjrand&0.7923694306518883'for i in range(0x7fffffff):print (download_file(url, savePath + str(i+1) +'.jpg'))time.sleep(1) #休眠一秒