[关闭]
@zhengyuhong 2016-03-19T08:55:48.000000Z 字数 5075 阅读 1400

BeautifulSoup

python BeautifulSoup

解析文档生成BeautifulSoup对象

  1. import bs4
  1. html_doc = """
  2. <html><head><title>The Dormouse's story</title></head>
  3. <body>
  4. <p class="title"><b>The Dormouse's story</b></p>
  5. <p class="story">Once upon a time there were three little sisters; and their names were
  6. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  7. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  8. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  9. and they lived at the bottom of a well.</p>
  10. <p class="story">...</p>
  11. """
  1. soup = bs4.BeautifulSoup(html_doc)

BeautifulSoup选择最合适的解析器来解析文档,如果手动指定解析器那么BeautifulSoup会选择指定的解析器来解析文档。

  1. soup = bs4.BeautifulSoup(html_doc, 'html.parser', from_encoding='utf8')

参考解析成XML

  1. markup =
  2. """
  3. <?xml version="1.0" encoding="UTF-8"?>
  4. <recipe type="dessert">
  5. <recipename cuisine="american" servings="1">Ice Cream Sundae</recipename>
  6. <preptime>5 minutes</preptime>
  7. </recipe>
  8. """
  9. soup = bs4.BeautifulSoup(markup, "xml")

要解析的文档是什么类型: 目前支持, “html”, “xml”, 和 “html5”,指定使用哪种解析器: 目前支持, “lxml”, “html5lib”, 和 “html.parser”任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode:

  1. markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
  2. soup = BeautifulSoup(markup)
  3. soup.h1
  4. # <h1>Sacré bleu!</h1>
  5. soup.h1.string
  6. # u'Sacr\xe9 bleu!'

更多帮助文档

输出文档

通过Beautiful Soup输出文档时,不管输入文档是什么编码方式,输出编码均为UTF-8编码,下面例子输入文档是Latin-1编码:

  1. markup = b'''
  2. <html>
  3. <head>
  4. <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" />
  5. </head>
  6. <body>
  7. <p>Sacr\xe9 bleu!</p>
  8. </body>
  9. </html>
  10. '''
  11. soup = BeautifulSoup(markup)
  12. print(soup.prettify())
  13. # <html>
  14. # <head>
  15. # <meta content="text/html; charset=utf-8" http-equiv="Content-type" />
  16. # </head>
  17. # <body>
  18. # <p>
  19. # Sacré bleu!
  20. # </p>
  21. # </body>
  22. # </html>

获取结构化数据

  1. soup = bs4.BeautifulSoup(html_doc)

通过标签名获取标签

  1. html = soup.html
  2. # 获取html标签
  3. title = soup.title
  4. print title
  5. # <title>The Dormouse's story</title>
  6. a = soup.a
  7. print a
  8. # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

获取标签的标签名

  1. tag = soup.title
  2. print tag.name
  3. # 在遍历文档树的时候就会显得比较有用

获取标签的属性名以及属性值

  1. tag = soup.a
  2. attrs = tag.attrs
  3. pritn type(attrs)
  4. # <type 'dict'>
  5. print attrs
  6. # {u'href': u'http://example.com/elsie', u'class': [u'sister'], u'id': u'link1'}
  7. 然后直接可以使用python 字典方法操纵标签的属性名以及属性值,可增添删减

获取父节点

  1. title_tag = soup.title
  2. title_tag
  3. # <title>The Dormouse's story</title>
  4. title_tag.parent
  5. # <head><title>The Dormouse's story</title></head>

find_all

字符串匹配标签名搜索

  1. soup.find_all('b')
  2. # [<b>The Dormouse's story</b>]

如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

列表匹配标签名搜索

  1. soup.find_all(["a", "b"])
  2. # [<b>The Dormouse's story</b>,
  3. # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  4. # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  5. # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

正则表达式匹配标签名搜索

  1. import re
  2. for tag in soup.find_all(re.compile("^b")):
  3. print(tag.name)
  4. # body
  5. # b

函数匹配搜索(最强大)

  1. def has_class_but_no_id(tag):
  2. return 'class' in tag.attrs and 'id' not in tag.attrs
  3. soup.find_all(has_class_but_no_id)
  4. # [<p class="title"><b>The Dormouse's story</b></p>,
  5. # <p class="story">Once upon a time there were...</p>,
  6. # <p class="story">...</p>]

可以用lambda表达式等

字符串匹配属性值搜索

  1. soup.find_all(id="link2")
  2. # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  3. data_soup.find_all(attrs={"data-foo": "value"})
  4. # [<div data-foo="value">foo!</div>]

正则表达式匹配属性值搜索

  1. soup.find_all(href=re.compile("elsie"))
  2. # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
  3. soup.find_all(href=re.compile("elsie"), id='link1')
  4. # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

正则表达式匹配文本数据搜索

  1. import re
  2. soup.find(text=re.compile("sisters"))
  3. # u'Once upon a time there were three little sisters; and their names were\n'

标签名,属性值,文本值多项混搭匹配

  1. soup.find_all("p", "title")
  2. # [<p class="title"><b>The Dormouse's story</b></p>]

get_text

如果只想得到tag中包含的文本内容,那么可以嗲用 get_text() 方法,这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回:

  1. markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
  2. soup = BeautifulSoup(markup)
  3. soup.get_text()
  4. u'\nI linked to example.com\n'
  5. soup.i.get_text()
  6. u'example.com'
  7. # 获取某一个元素以下的所有文本数据
  8. print(soup.get_text())
  9. """
  10. The Dormouse's story
  11. The Dormouse's story
  12. Once upon a time there were three little sisters; and their names were
  13. Elsie,
  14. Lacie and
  15. Tillie;
  16. and they lived at the bottom of a well.
  17. ...
  18. """

遍历文档树

  1. tag .contents 属性可以将tag的子节点以列表的方式输出:
  2. print type(soup.html.children)
  3. print type(soup.html.contents)
  4. # <type 'listiterator'>
  5. # <type 'list'>
  6. print [e for e in soup.html.children] == soup.html.contents
  7. # True
  8. for tag in soup.html.children:
  9. print tag.name

CSS选择器

通过tag标签逐层查找

  1. soup.select("body a")
  2. # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  3. # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  4. # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  5. soup.select("html head title")
  6. # [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签

  1. soup.select("head > title")
  2. # [<title>The Dormouse's story</title>]
  3. soup.select("p > a")
  4. # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
  5. # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  6. # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  7. soup.select("p > a:nth-of-type(2)")
  8. # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  9. soup.select("p > #link1")
  10. # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
  11. soup.select("body > a")
  12. # []
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注