@twein89
        
        2017-06-12T09:51:58.000000Z
        字数 2606
        阅读 1333
    爬虫
仓库地址: 
repository:git@gitlab.qjdchina.com:yantengwei/business-python.git * branch:master * tag:d0.0.3-1
参考api文档 
页面解析器pyquery文档 
框架作者博客里有pyspider中文系列教程,可以先看这个
想要调试一个项目比如court_zhixing.py这个脚本 
可以在本地用pyspider one命令调试,例如pyspider one court/court_zhixing.py 
或者可以在webUI调试。
1.项目数据迁移: 
因为pyspider在webUI中save的项目是存放在数据库中projectdb表里的,有时需要在文件数据库和其他类型数据库之间进行迁移, 
以sqlite和mongodb为例: 
cd spider_tender 
把项目从sqlite迁移到mongdb: 
python tools/migrate.py sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db  mongodb+projectdb://127.0.0.1:27017/projectdb 
把项目从mongdb迁移到本地sqlite: 
python tools/migrate.py mongodb+projectdb://127.0.0.1:27017/projectdb sqlite+projectdb:////Users/yantengwei/PycharmProjects/mywork/spider_tender/data/project.db
2.项目提交:本地从local_config启动项目: 
pyspider -c config/local_config.json 
在webUI中贴入代码保存,项目会自动保存到本地sqlite,spider_tender/data/project.db中,然后进行git提交即可更新项目
3.项目部署:环境配置及项目部署见部署文档
1.config文件说明: 
pre, online的配置文件在spider_tender/config文件夹下 
以online_config.json为例:
{
"result_worker": {
"result_cls": "mongo_result_worker.OnlineResultWorker"
},
"court_mongo_param": {
"ip": "10.171.250.51",
"port": "27017",
"database": "proxy",
"username": "proxy",
"password": "proxy_1",
"replica": "online"
}
}
result_worker这个是属于框架自带设置
"result_worker": {
"result_cls": "mongo_result_worker.OnlineResultWorker"
},
这部分是属于自定义的配置
"court_mongo_param": {
"ip": "10.171.250.51",
"port": "27017",
"database": "proxy",
"username": "proxy",
"password": "proxy_1",
"replica": "online"
}
自定义配置的读取利用了click模块的api,读取court的mongodb配置的代码写在spider_tender/court/court_db.py里,
config = click.get_current_context().__dict__['obj']['config']
db_param = config.get('court_mongo_param', 0)
ip = db_param['ip']
port = int(db_param['port'])
database = db_param['database']
replica = db_param['replica']
username = db_param['username']
password = db_param['password']
2.抓取结果保存 
两种方式:一种通过写result_worker,另一种在抓取脚本里写on_result 
如果想从一个response里面返回多个结果,可以使用self.send_message和on_message这两个方法。参考文档中http://docs.pyspider.org/en/latest/apis/self.send_message/
def detail_page(self, response):
for i, each in enumerate(response.json['products']):
self.send_message(self.project_name, {
"name": each['name'],
'price': each['prices'],
}, url="%s#%s" % (response.url, i))
def on_message(self, project, msg):
return msg
并且在court_wenshu.py中有例子
3.代理配置:
from pyspider.libs.base_handler import *
from settings import USER_AGENT
class Handler(BaseHandler):
crawl_config = {
'itag': 'v10',
'proxy': 'http://duoipyewfynqt:wAgJx3wLYjF6A@222.184.35.196:33610',
'headers': {
'User-Agent': USER_AGENT
}
}
@every(minutes=3 * 60)
def on_start(self):
for i in range(20):
self.crawl("http://ip.chinaz.com/getip.aspx#{}".format(i), callback=self.page_parse)
def page_parse(self, response):
print(response.text)