@zhengyuhong
2016-03-19T08:55:48.000000Z
字数 5075
阅读 1400
python BeautifulSoup
import bs4
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
soup = bs4.BeautifulSoup(html_doc)
BeautifulSoup选择最合适的解析器来解析文档,如果手动指定解析器那么BeautifulSoup会选择指定的解析器来解析文档。
soup = bs4.BeautifulSoup(html_doc, 'html.parser', from_encoding='utf8')
参考解析成XML
markup ="""<?xml version="1.0" encoding="UTF-8"?><recipe type="dessert"><recipename cuisine="american" servings="1">Ice Cream Sundae</recipename><preptime>5 minutes</preptime></recipe>"""soup = bs4.BeautifulSoup(markup, "xml")
要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”,指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:
markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"soup = BeautifulSoup(markup)soup.h1# <h1>Sacré bleu!</h1>soup.h1.string# u'Sacr\xe9 bleu!'
通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:
markup = b'''<html><head><meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /></head><body><p>Sacr\xe9 bleu!</p></body></html>'''soup = BeautifulSoup(markup)print(soup.prettify())# <html># <head># <meta content="text/html; charset=utf-8" http-equiv="Content-type" /># </head># <body># <p># Sacré bleu!# </p># </body># </html>
soup = bs4.BeautifulSoup(html_doc)
html = soup.html# 获取html标签title = soup.titleprint title# <title>The Dormouse's story</title>a = soup.aprint a# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
tag = soup.titleprint tag.name# 在遍历文档树的时候就会显得比较有用
tag = soup.aattrs = tag.attrspritn type(attrs)# <type 'dict'>print attrs# {u'href': u'http://example.com/elsie', u'class': [u'sister'], u'id': u'link1'}然后直接可以使用python 字典方法操纵标签的属性名以及属性值,可增添删减
title_tag = soup.titletitle_tag# <title>The Dormouse's story</title>title_tag.parent# <head><title>The Dormouse's story</title></head>
soup.find_all('b')# [<b>The Dormouse's story</b>]
如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错
soup.find_all(["a", "b"])# [<b>The Dormouse's story</b>,# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
import refor tag in soup.find_all(re.compile("^b")):print(tag.name)# body# b
def has_class_but_no_id(tag):return 'class' in tag.attrs and 'id' not in tag.attrssoup.find_all(has_class_but_no_id)# [<p class="title"><b>The Dormouse's story</b></p>,# <p class="story">Once upon a time there were...</p>,# <p class="story">...</p>]
可以用lambda表达式等
soup.find_all(id="link2")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]data_soup.find_all(attrs={"data-foo": "value"})# [<div data-foo="value">foo!</div>]
soup.find_all(href=re.compile("elsie"))# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.find_all(href=re.compile("elsie"), id='link1')# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
import resoup.find(text=re.compile("sisters"))# u'Once upon a time there were three little sisters; and their names were\n'
soup.find_all("p", "title")# [<p class="title"><b>The Dormouse's story</b></p>]
如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'soup = BeautifulSoup(markup)soup.get_text()u'\nI linked to example.com\n'soup.i.get_text()u'example.com'# 获取某一个元素以下的所有文本数据print(soup.get_text())"""The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well...."""
tag的 .contents 属性可以将tag的子节点以列表的方式输出:print type(soup.html.children)print type(soup.html.contents)# <type 'listiterator'># <type 'list'>print [e for e in soup.html.children] == soup.html.contents# Truefor tag in soup.html.children:print tag.name
soup.select("body a")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("html head title")# [<title>The Dormouse's story</title>]
soup.select("head > title")# [<title>The Dormouse's story</title>]soup.select("p > a")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]soup.select("p > a:nth-of-type(2)")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]soup.select("p > #link1")# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]soup.select("body > a")# []