@weakiwi
2015-11-04T07:49:00.000000Z
字数 3899
阅读 1598
爬虫
主要诱因是爆流量了,然后又刚好最近比赛密集,所以就想做一个能推送当日比赛成绩以及战报的小爬虫了。一开始目标是懂球帝,结果发现竟然是动态内容。后来终于让我找到了网易体育这个对于制作爬虫而言非常良心的网站。好了,既然有了网站我们就可以开始分析啦。
<tr><td>11</td><td>02:45</td><td>完场</td><td><span class="c1"><a href="/58/team/7431.html">马尔默</a><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/7431.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /></span></td>#主队<td><span class="c3"><a href="/58/match/stat/2015/1574739.html" target="_blank">1-0</a></span></td>#比分<td><span class="c2"><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6457.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /><a href="/58/team/6457.html">顿涅茨克矿工</a></span></td>#客队<td> </td><td class="bg2 bg7"><a href="/58/match/stat/2015/1574739.html" target="_blank">统计</a> | <span class="cur_hand" id="check_1574739_58_2015" style="cursor: pointer;" >查看详细</span> <img src="http://img1.cache.netease.com/sports/2009/goal/slbg33.gif" width="5" height="13" /> |<a href="http://caipiao.163.com/order/preBet_jczqspfp.html&&t=2325#from=sj1" target="_blank">投注</a></td></tr>
从这段代码中不难看出。class='c1'对应的是span标签的正文是主队,class='c2'对应的则是客队,而class='c3'对应的则是比分。然后接下来就可以形成代码啦!
def get_bifeng(mytime):# date = '20151101'date = mytimegoal_url = 'http://goal.sports.163.com/schedule/' + date + '.html'#构成网址response = urllib2.urlopen(goal_url)page = response.read()soup = BeautifulSoup(page)#构造bstag_zhudui = soup.find_all('span', 'c1')tag_kedui = soup.find_all('span', 'c2')tag_bifeng = soup.find_all('span', 'c3')temp_bifeng = []bifeng = ' '
通过这段代码可以轻松得到主队,客队还有比分的列表。那么如何把三个列表合并并且格式化输出,我用的是比较简单的两个两个合并然后输出的方法,代码如下:
for (i, j) in zip(tag_zhudui, tag_bifeng):temp_bifeng.append(i.get_text().encode('utf-8') + ' ' + j.get_text().encode('utf-8') + ' ')for (i, j) in zip(tag_kedui, temp_bifeng):bifeng += (j + ' ' + i.get_text().encode('utf-8')+'\n')return bifeng
这段函数有个缺陷就是偶尔会有当日没有结束的比赛,这就造成了,有主队和客队却没有对应的比分选项,以后再修这个bug。
接下来就是获取战报啦,咱们还是先分析网页的源码:
<tr><td>11</td><td>02:45</td><td>完场</td><td><span class="c1"><a href="/58/team/6409.html">巴黎圣日耳曼</a><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6409.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /></span></td><td><span class="c3"><a href="/58/match/stat/2015/1574752.html" target="_blank">0-0</a></span></td><td><span class="c2"><img src="http://imgsize.ph.126.net/?imgurl=http://goal.sports.163.com/teamlogo/6171.png_20x20x1.jpg" width="20" height="20" onerror="this.src='http://img1.cache.netease.com/sports/2009/goal/logo/default_team_20.gif'" /><a href="/58/team/6171.html">皇家马德里</a></span></td><td> </td><td class="bg2 bg7">#战报对应的标签在这里!<a href="/58/match/report/2015/1574752.html" target="_blank">战报</a> | <span class="cur_hand" id="check_1574752_58_2015" style="cursor: pointer;" >查看详细</span> <img src="http://img1.cache.netease.com/sports/2009/goal/slbg33.gif" width="5" height="13" /> |
不难看出战报对应的标签为td下的class='bg2 bg7',然后形成代码:
def get_zhanbao(mytime):# date = '20151101'date = mytimegoal_url = 'http://goal.sports.163.com/schedule/' + date + '.html'#构成网址response = urllib2.urlopen(goal_url)page = response.read()soup = BeautifulSoup(page)#构造bstag = soup.find_all(class_='bg2 bg7')#查找对应标签zhanbao = []mail_content = ''for i in tag:if i.get_text().encode('utf-8')[7:13] == '战报':#判断关键字段为‘统计’还是‘战报’,这里注意utf-8三个值为一个汉字zhanbao.append('http://goal.sports.163.com'+i.find('a').get('href'))#存储所有‘战报’的超链接# print zhanbaofor i in zhanbao:response = urllib2.urlopen(i)page = response.read()soup = BeautifulSoup(page)tag = soup.find_all('b')#在战报页面中,‘b’标签(加粗),对应的就是打门,这就是我所需要的啦。if tag == '':breakfor i in tag:mail_content += i.get_text().encode('utf-8')+'|'#构造邮件内容print i# print mail_contentreturn mail_content
然后再写好函数部分,最后加到crontab里面就ok啦,至于邮箱,网上py发邮箱的方法很多,就不再啰嗦啦。
能够改进的地方: