@nemos 2017-05-06T02:01:10.000000Z 字数 2519 阅读 809

beautifulsoup

py

解析器

Python标准库
- 'html.parser'
lxml HTML解析器
- 'lxml'
- 需要安装C语言库
- 速度较快
- lxml XML 解析器
- 'xml'
- ['lxml', 'xml']
- 需要安装C语言库
- 支持xml
html5lib
- 'html5lib'
- 速度较慢
- 容错性较强
- 解析成HTML5

基本用法

from bs4 import BeautifulSoup
soupobj = BeautifulSoup('htmltext', 'html.parser') #将文本转换BS对象,指定解析器
betterhtml = soupobj.prettify() #转换为xhtml，优化格式

四大对象种类

示例HTML

'''
<html>
<head><title>The Dormouse's story</title></head>
<body>
  <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
   ;and they lived at the bottom of a well.</p>
   <p class="story">...</p>
</body>
</html>
'''

Tag

HTML标签加上标签内的内容即为Tag

soup.tag #可直接获得tag
#第一个符合的Tag

>>> soup.title
title>The Dormouse's story</title>
>>> soup.p
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

Tag属性

name属性

>>> soup.head.name #标签的名字即为自身
head

attrs属性

>>> soup.p.attrs #HTML元素属性的字典
{'class': ['title'], 'name': 'dromouse'}
>>> soup.p['class']
>>> soup.p.attrs['class']
>>> soup.p.get('class') #以上三种全部等价
['title']

NavigableString

>>> soup.p.string #获得标签内部的文字
The Dormouse's story

BeautifulSoup

>>> soup.name #BeautifulSoup即是一个特殊的Tag对象
[document]

Comment

特殊的NavigableString
即注释内容
类型为bs4.element.Comment

遍历文档树

直接子节点

soup.tag.contents   #以list形式返回tag的子节点
soup.tag.children   #返回list的生成器对象
soup.tag.descendants#返回所有子孙节点

节点内容

soup.tag.string          #返回节点内容
                         #可返回唯一子节点的内容，多个内容则为None
soup.tag.strings         #获取多个内容
soup.tag.stripped_strings#去除了空白符的字符串

父节点

soup.tag.parent
soup.head.title.string.parent.name #还是title
soup.tag.parents

兄弟节点

soup.tag.next_sibling #前兄弟节点
soup.tag.prev_sibling
soup.tag.next_siblings
soup.tag.prev_siblings

前后节点

soup.tag.next_element    #获得前后节点
soup.tag.previous_element#前后节点不分层次

搜索文档树

#搜索当前tag的所有tag子节点，并判断是否符合过滤器条件
find_all(name,#传字符串查找相应的tag
              #传正则会调用match来匹配tag
              #传列表将任意一匹配元素返回
              #传True匹配任何值
              #传方法以tag为参数，返回True则匹配
              #
    attrs,    #可以知指定特殊的属性如attrs={'data-foo' : 'value'}
    recursive,#如果只想搜索直接子节点点，可指定为False
    text,     #搜索文档中的字符串内容
    limit,    #限制返回个数
    **kwargs) #可过滤指定的tag属性如id='id1',href=re.compile('some')
              #关键字加下划线，class_='class1'

find()                  #类似find_all，但只返回一个节点
find_parents()          #找父节点
find_parent()
find_next_siblings()    #找下一个兄弟节点
find_next_sibling()
find_previous_siblings()#找前一个兄弟节点
find_previos_sibling()
find_all_next()         #找下一个节点，无层次关系
find_next()
find_all_previous()     #同上
find_previos()

CSS选择器

#返回列表，用get_text()方法获得内容
soup.select('title')                      #标签名查找
soup.select('.class1')                    #类名查找
soup.select('#id1')                       #id查找
soup.select('title #id1')                 #组合查找
soup.select('head > title')               #子标签查找
soup.select('a[href="http://python.org"]')#属性查找

beautifulsoup

解析器

基本用法

四大对象种类

示例HTML

Tag

Tag属性

NavigableString

BeautifulSoup

Comment

遍历文档树

直接子节点

节点内容

父节点

兄弟节点

前后节点

搜索文档树

CSS选择器

内容目录