@zhengyuhong 2016-03-19T08:55:48.000000Z 字数 5075 阅读 1478

BeautifulSoup

`python` `BeautifulSoup`

解析文档生成BeautifulSoup对象

import bs4

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = bs4.BeautifulSoup(html_doc)

BeautifulSoup选择最合适的解析器来解析文档，如果手动指定解析器那么BeautifulSoup会选择指定的解析器来解析文档。

soup = bs4.BeautifulSoup(html_doc, 'html.parser', from_encoding='utf8')

参考解析成XML

markup = 
"""
<?xml version="1.0" encoding="UTF-8"?>
<recipe type="dessert">
    <recipename cuisine="american" servings="1">Ice Cream Sundae</recipename>
    <preptime>5 minutes</preptime>
</recipe>
"""
soup = bs4.BeautifulSoup(markup, "xml")

要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”,指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1
# <h1>Sacré bleu!</h1>
soup.h1.string
# u'Sacr\xe9 bleu!'

更多帮助文档

输出文档

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:

markup = b'''
<html>
  <head>
    <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
  </head>
  <body>
    <p>Sacr\xe9 bleu!</p>
  </body>
</html>
'''
soup = BeautifulSoup(markup)
print(soup.prettify())
# <html>
#  <head>
#   <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
#  </head>
#  <body>
#   <p>
#    Sacré bleu!
#   </p>
#  </body>
# </html>

获取结构化数据

soup = bs4.BeautifulSoup(html_doc)

通过标签名获取标签

html = soup.html
# 获取html标签
title = soup.title
print title
# <title>The Dormouse's story</title>
a = soup.a
print a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

获取标签的标签名

tag = soup.title
print tag.name
# 在遍历文档树的时候就会显得比较有用

获取标签的属性名以及属性值

tag = soup.a
attrs = tag.attrs
pritn type(attrs)
# <type 'dict'>
print attrs
# {u'href': u'http://example.com/elsie', u'class': [u'sister'], u'id': u'link1'}
然后直接可以使用python 字典方法操纵标签的属性名以及属性值，可增添删减

获取父节点

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

find_all

字符串匹配标签名搜索

soup.find_all('b')
# [<b>The Dormouse's story</b>]

如果传入字节码参数,Beautiful Soup会当作UTF-8编码，可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

列表匹配标签名搜索

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

正则表达式匹配标签名搜索

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

函数匹配搜索（最强大）

def has_class_but_no_id(tag):
    return 'class' in tag.attrs and 'id' not in tag.attrs
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

可以用lambda表达式等

字符串匹配属性值搜索

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

正则表达式匹配属性值搜索

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

正则表达式匹配文本数据搜索

import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

标签名，属性值，文本值多项混搭匹配


soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

get_text

如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
# 获取某一个元素以下的所有文本数据
print(soup.get_text())
"""
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""

遍历文档树

tag的 .contents 属性可以将tag的子节点以列表的方式输出:
print type(soup.html.children)
print type(soup.html.contents)
# <type 'listiterator'>
# <type 'list'>
print [e for e in soup.html.children] == soup.html.contents
# True
for tag in soup.html.children:
    print tag.name

CSS选择器

通过tag标签逐层查找

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("html head title")
# [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签

soup.select("head > title")
# [<title>The Dormouse's story</title>]
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("body > a")
# []