Introduction

Redis-based components for Scrapy.

Free software: MIT license
Python support: 3.8+
Scrapy support: 2.6+

Features

Distributed crawling/scraping
- You can start multiple spider instances that share a single redis queue. Best suitable for broad multi-domain crawls.
Distributed post-processing
- Scraped items gets pushed into a redis queued meaning that you can start as many as needed post-processing processes sharing the items queue.
Scrapy plug-and-play components
- Scheduler + Duplication Filter, Item Pipeline, Base Spiders.
In this forked version: added json supported data in Redis

data contains url, meta and other optional parameters. meta is a nested json which contains sub-data. this function extract this data and send another FormRequest with url, meta and addition formdata.

For example:
```
    {"url": "https://exaple.com", "meta": {"job-id":"123xsd", "start-date":"dd/mm/yy"}, "url_cookie_key":"fertxsas" }
```
this data can be accessed in scrapy spider through response. like: response.url, response.meta, response.url_cookie_key

Note

This features cover the basic case of distributing the workload across multiple workers. If you need more features like URL expiration, advanced URL prioritization, etc., we suggest you to take a look at the Frontera_ project.

Requirements

Python 3.8, 3.9, 3.10, 3.11
Redis >=5.0
Scrapy >=2.6.0
redis-py >=4.2

Overview

Introduction

Installation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction

Features

Note

Requirements

Overview

Introduction

Installation

Usage

Example Usage

Basic Concept

Feeding a Spider from Redis

Running a example project

Contribution

Types of Contributions

Getting Started

History

History

Examples

Persist data on database or local file

Clone this wiki locally