环境准备

virtualenv spiderenv
source spiderenv/bin/activate
pip install Scrapy

安装过程报错

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 49:

解决办法

找到虚拟目录下的lib/python2.7/site.py文件
/home/zhanghe/code/wealink/wealink-web-spider/spiderenv/lib/python2.7/site.py
def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii"  # Default value set by _PyUnicode_Init()
    if 0:  # 改成 if 1 (只修改第一个if 0 为 if 1)

重新安装Scrapy

pip uninstall Scrapy
pip install Scrapy

不重新安装又会报错：

ImportError: Twisted requires zope.interface 3.6.0 or later: no module named zope.interface.

安装系统依赖

$ sudo apt-get install libffi-dev

查看安装好的库文件列表

$ pip list
Scrapy==1.0.3
Twisted==15.4.0
argparse==1.2.1
cffi==1.2.1
characteristic==14.3.0
cryptography==1.0.1
cssselect==0.9.1
enum34==1.0.4
idna==2.0
ipaddress==1.0.14
lxml==3.4.4
pyOpenSSL==0.15.1
pyasn1==0.1.8
pyasn1-modules==0.0.7
pycparser==2.14
queuelib==1.4.2
scrapyd==1.1.0
scrapyd-client==1.0.1
service-identity==14.0.0
six==1.9.0
w3lib==1.12.0
wsgiref==0.1.2
zope.interface==4.1.2

导出库文件列表（仅本地）

$ pip freeze > requirements.txt

安装依赖库

$ pip install -r requirements.txt

Scrapy使用步骤

Scrapy 1.0 documentation

Scrapy入门教程

创建项目

$ scrapy startproject csdn

进入项目目录

$ cd csdn

创建一个新的spider（蜘蛛）

$ scrapy genspider csdnblog blog.csdn.net

使用刚刚创建的spider进行爬取

$ scrapy crawl csdnblog

在终端调试选择器

$ scrapy shell "http://blog.csdn.net/QH_JAVA/article/category/1710027"

保存抓取数据

scrapy crawl csdnblog -o items.json

有一个警告，先忽略

2015-09-21 16:44:06 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named pyasn1_modules.rfc2459'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

文章列表HTML结构：

<div class="list_item article_item">
    <span class="link_title">
    <div class="article_description">

在终端测试选择器

>>> response.xpath('//div/h1/span').extract()

标题

>>> response.xpath('//div/h1/span/a/text()').extract()

标题链接

>>> response.xpath('//div/h1/span/a/@href').extract()

简介

>>> response.xpath('//div[@class="article_description"]/text()').extract()

详细页面HTML结构：

<div id="article_details" class="details">
    <div class="article_title">
        <h1>
            <span class="link_title">
                <a>
    <div id="article_content" class="article_content">

在终端测试选择器

标题

>>> response.xpath('//div[@class="details"]/div[@class="article_title"]/h1/span/a/text()').extract()

链接

>>> response.xpath('//div[@class="details"]/div[@class="article_title"]/h1/span/a/@href').extract()

内容

>>> response.xpath('//div[@class="details"]/div[@class="article_content"]').extract()

Scrapyd

Scrapyd is an application for deploying and running Scrapy spiders

Scrapyd官方文档

scrapyd-client

开启Scrapyd

$ scrapyd
2015-09-22 16:28:08+0800 [-] Log opened.
2015-09-22 16:28:08+0800 [-] twistd 15.4.0 (/home/zhanghe/code/wealink/wealink-web-spider/spiderenv/bin/python 2.7.6) starting up.
2015-09-22 16:28:08+0800 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2015-09-22 16:28:08+0800 [-] Site starting on 6800
2015-09-22 16:28:08+0800 [-] Starting factory <twisted.web.server.Site instance at 0xb614b36c>
2015-09-22 16:28:08+0800 [Launcher] Scrapyd 1.1.0 started: max_proc=4, runner='scrapyd.runner'

部署项目

$ scrapyd-deploy csdn_deploy -p csdn
Packing version 1442910194
Deploying to project "csdn" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "csdn", "version": "1442910194", "spiders": 1, "node_name": "ubuntu"}

查看部署项目列表

$ scrapyd-deploy -l
csdn_deploy          http://localhost:6800/

查看具体部署项目

$ scrapyd-deploy -L csdn_deploy
default
csdn

查看项目列表

$ curl http://localhost:6800/listprojects.json
{"status": "ok", "projects": ["default", "csdn"], "node_name": "ubuntu"}

查看版本列表

$ curl http://localhost:6800/listversions.json?project=csdn
{"status": "ok", "versions": ["1442910194"], "node_name": "ubuntu"}

查看spider列表

$ curl http://localhost:6800/listspiders.json?project=csdn
{"status": "ok", "spiders": ["csdnblog"], "node_name": "ubuntu"}

调度新任务

$ curl http://localhost:6800/schedule.json -d project=csdn -d spider=csdnblog
{"status": "ok", "jobid": "6cf34b2e611011e59ff6000c29e23801", "node_name": "ubuntu"}

查看任务

$ curl http://localhost:6800/listjobs.json?project=csdn
{"status": "ok", 
"running": [], 
"finished": [
{"start_time": "2015-09-22 17:49:33.795207", "end_time": "2015-09-22 17:49:35.650616", "id": "38e5b66a610f11e59ff6000c29e23801", "spider": "csdnblog"}, 
{"start_time": "2015-09-22 17:50:08.796791", "end_time": "2015-09-22 17:50:10.791481", "id": "4ecfb4bc610f11e59ff6000c29e23801", "spider": "csdnblog"}, 
{"start_time": "2015-09-22 17:58:08.797639", "end_time": "2015-09-22 17:58:10.793856", "id": "6cf34b2e611011e59ff6000c29e23801", "spider": "csdnblog"}
], 
"pending": [], 
"node_name": "ubuntu"}

取消任务

$ curl http://localhost:6800/cancel.json -d project=csdn -d job=6cf34b2e611011e59ff6000c29e23801

删除项目

$ curl http://localhost:6800/delproject.json -d project=csdn

框架介绍

架构概览

架构概览-英文原版

架构概览-中文翻译

Scrapy的架构：

 ----------------------------------------------------------------------------------------
|                                  [调度器]
|                                (Scheduler)
|                                     |
|                                     |
|                                     |
|                                     |
|                                     |                    下载器
|[Item Pipeline] ------------- [Scrapy Engine] ----------- 中间件 ----------- [下载器]
|                                     |          (Downloader middlewares)  (Downloader)
|                                     |
|                                     |
|                               Spider中间件
|                           (Spider middlewares)
|                                     |
|                                     |
|                                     |
|                                 [Spiders]
 ----------------------------------------------------------------------------------------

组件介绍

[Scrapy Engine]

引擎负责控制数据流在系统中所有组件中流动，并在相应动作发生时触发事件。 
详细内容查看下面的数据流(Data Flow)部分。

[调度器(Scheduler)]

调度器从引擎接受request并将他们入队，以便之后引擎请求他们时提供给引擎。

[下载器(Downloader)]

下载器负责获取页面数据并提供给引擎，而后提供给spider。

[Spiders]

Spider是Scrapy用户编写用于分析response并提取item(即获取到的item)或额外跟进的URL的类。
每个spider负责处理一个特定(或一些)网站。

[Item Pipeline]

Item Pipeline负责处理被spider提取出来的item。
典型的处理有清理、 验证及持久化(例如存取到数据库中)。
当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

[下载器中间件(Downloader middlewares)]

下载器中间件是在引擎及下载器之间的特定钩子(specific hook)，处理Downloader传递给引擎的response。
其提供了一个简便的机制，通过插入自定义代码来扩展Scrapy功能。

[Spider中间件(Spider middlewares)]

Spider中间件是在引擎及Spider之间的特定钩子(specific hook)，处理spider的输入(response)和输出(items及requests)。
其提供了一个简便的机制，通过插入自定义代码来扩展Scrapy功能。更多内容请看 Spider中间件(Middleware) 。

更多可以参考Scrapy实例演示

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapy.md

Scrapy.md

环境准备

Scrapy使用步骤

Scrapyd

框架介绍

架构概览

组件介绍

Files

Scrapy.md

Latest commit

History

Scrapy.md

File metadata and controls

环境准备

Scrapy使用步骤

Scrapyd

框架介绍

架构概览

组件介绍