Redis-based components for Scrapy.
- Usage: https://github.com/rmax/scrapy-redis/wiki/Usage
- Documentation: https://github.com/rmax/scrapy-redis/wiki.
- Release: https://github.com/rmax/scrapy-redis/wiki/History
- Contribution: https://github.com/rmax/scrapy-redis/wiki/Getting-Started
- LICENSE: MIT license
Distributed crawling/scraping
You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing
Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added
json
supported data in Redisdata contains
url
,`meta`
and other optional parameters.meta
is a nested json which contains sub-data. this function extract this data and send another FormRequest withurl
,meta
and additionformdata
.For example:
{ "url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
this data can be accessed in scrapy spider through response. like: request.url, request.meta, request.cookies
Note
This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera project.
- Python 3.7+
- Redis >= 5.0
Scrapy
>= 2.0redis-py
>= 4.0
From pip
pip install scrapy-redis
From GitHub
git clone https://github.com/darkrho/scrapy-redis.git
cd scrapy-redis
python setup.py install
Note
For using this json supported data feature, please make sure you have not installed the scrapy-redis through pip. If you already did it, you first uninstall that one.
pip uninstall scrapy-redis
Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.