Python 爬虫入门《上》

@FadeTrack 2015-09-07T12:26:48.000000Z 字数 3353 阅读 2676

Python 爬虫

一、说说什么是爬虫

这里是百度百科给出的定义

网络爬虫是一种自动获取网页内容的程序，是搜索引擎的重要组成部分。

为何要使用Python写爬虫
使用Python写爬虫，是由于Python中丰富的模块可以帮助我们很方便的写出爬虫，而不是说爬虫是Python独有的，没有最好的语言只有最合适的语言，这里是一个比较好的体现，如果我们是使用 C++ 来写爬虫就有些得不偿失的意思了。

明白上面的问题后，那么问题就可以引申为
怎么获取网页?
通过模块
废话，我问的是有什么模块?
为了简单起见，这两篇博文主要就说一下 urllib.request 这个模块

二、有目的地学习如何使用一个陌生模块 ---- urllib.request

首先我们需要包含一下 urllib.request

import urllib.request

接下来使用 dir(urllib.reuqest) 查看一下其支持的方法
因为是初级爬虫嘛，我们只需要关心这个 urlopen
打开 Python 3.4 的帮助手册在索引处输入 urlopen

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Open the URL url, which can be either a string or a Request object.

打开一个 url ,URL可以是一个 request 对象也可以是一个字符串.
对于对象我们暂时不管，先简单的用上。

ret = urllib.request.urlopen('http://www.baidu.com')

它会返回一个对象，什么对象呢？其实完全可以不用管，这就是Python 的特色之一，当然我们也可以使用 type(ret) 来查看返回的对象类型

<class 'http.client.HTTPResponse'>

然而是否知道这个对象也不影响我们使用它。因为我们可以故技重施。
使用 dir(ret) 或者 print(help(ret)) 来查看一下这个对象的成员函数
我们可以轻松地找到 read() readall() 这种字眼。
故而我们可以得出大致需要用到这些函数去读出获取自网页的数据。

将 readall() 或者 read() 或者其他的 seek() 粘贴到帮助文档里面。

很轻松的我们就能看到这样一句话

read(size=-1)
Read up to size bytes from the object and return them.
As a convenience, if size is unspecified or -1, readall() is called.
Otherwise, only one system call is ever made. Fewer than size bytes
may be returned if the operating system call returns fewer than size
bytes.

If 0 bytes are returned, and size was not 0, this indicates end of
file. If the object is in non-blocking mode and no bytes are
available, None is returned.

readall()
Read and return all the bytes from the stream until EOF,
using multiple calls to the stream if necessary.

简单的说就是 read() 是从这个对象里面将数据读出来，另一个信息则是 read() 是对 readall() 的封装。

到这里我们就可以简单的写上：

html = urllib.request.urlopen('http://www.baidu.com').read()

此时的 html 就是这个页面的字节码了.

>>> type(html)
<class 'bytes'>

剩下的还是老办法，把 bytes 转成 str

html = str(html)

如果你切实的看到了这里,那么我们就要开始讨论一个新的问题了
当我使用 print(html) 试图去显示这个页面的源代码时，会发现所有的中文都变成了"乱码了"

我们是不是遗忘了什么？
对，我们回过头去看一下这个 bytes 类吧。

Bytes
A bytes object is an immutable array. The items are 8-bit bytes, represented by integers in the range 0 <= x < 256. Bytes literals (like b'abc') and the built-in function bytes() can be used to construct bytes objects. Also, bytes objects can be decoded to strings via the decode() method.

上面已经解释了， bytes 对象是一个不可变数组，他的条目是 8 位的,也就是说取值范围是 0~255(包括0)。
于是乎我们的中文编码不在这个范围，类型转换的时候 str对象就会认为通篇都是 Anis 码的.
上面也提到了，我们需要 decode() 方法来解码得到我们的中文内容。

好了跟入 decode(), 看看我们能干些啥？

bytearray.decode(encoding="utf-8", errors="strict")
Return a string decoded from the given bytes.
Default encoding is 'utf-8'. errors may be given to set a different error handling scheme.
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError.
Other possible values are 'ignore', 'replace' and any other name registered via codecs.
register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

这里可以看出，第一个参数是编码类型，第二个参数是是否为严格模式。
而且两个都具备默认参数。
所以我们可以简单的写

html = html.decode()

之后

html = str(html)

这里大家就要问了，你这里为什么啥都不填？

因为默认参数就是 'utf-8',而我们要填的正是 'utf-8'。

那为什么我们要填 'utf-8' 呢？

我们可以简单的在一个网页上点击右键 ---- 查看源代码
在源代码的开头可以发现如下字眼：

 <meta http-equiv=Content-Type content="text/html;charset=utf-8">

有人又说了，我打开看到的是：

<meta http-equiv="Content-Type" content="text/html; charset=gbk" />

这里我们只需要简单写

html = html.decode('gbk')

那小伙伴们可能就开始偷笑了，原来是这样啊，charset 后面是什么我们就填什么啊。
其实不然，真正绝定字符集到底是什么的还是字符集本身，这里是通常的案例。
关于详细的字符集的介绍的。
可以在 Python 3.4.3 的官方手册上检索 decode() ，找到 bytearray。

点击 decode 的详细说明中的 Standard Encodings 即可。

到这里我们已经成功迈出了第一步,模块的用法就大致明白了。
如果碰到一个新的模块要使用，可以通过这种方法。
当然这里限于标准模块和有文档支持的模块。 O(@.@)O

三、上篇的后话

看到这里了，就说明你真的很有毅力。虽然我想尽可能的写的有趣一些。
可以先试着熟悉
re ---- 正则
BeautifulSoup ---- 网页解析
os ---- 文件(夹)操作
pickle ---- 泡菜
等等模块

最后还有些我没提到的，请移步如何入门 Python 爬虫？

来自知乎大神的贴心解答，相信这么多干货一定能让你胃口大开，下篇再见~