Welcome to the Distributed Scraping Architecture project! This project leverages Scrapy, Celery, Redis, and scrapy-redis to create a scalable and robust web scraping framework.
In today's data-driven world, efficiently gathering and processing large datasets is crucial. This project aims to provide a distributed web scraping architecture that can handle large-scale data extraction tasks reliably.
- Scrapy: Powerful web crawling and scraping framework.
- Celery: Asynchronous task queue/job queue for distributing scraping tasks.
- Redis: In-memory data structure store used as a message broker.
- scrapy-redis: Integration to distribute Scrapy tasks across multiple nodes.
- Adding new way of executing scrapers using
subprocess
- Structured way to start distributed scraping for dummys
To get started, clone the repository and install the necessary dependencies:
git clone https://github.com/milan1310/distributed-scrapy-scraping.git
cd distributed-scrapy-scraping
pip install -r requirements.txt
- Start Redis: Make sure you have Redis installed and running.
redis-server
- Start Celery: Run Celery worker to process tasks.
celery -A tasks worker --loglevel=info
- Add URLs to Queue: Use the
add_urls.py
script to add URLs to the Redis queue.python add_urls.py
- Run Spider: Execute the spider to start scraping.
python run_spider.py
distributed-scrapy-scraping/
├── amazon_distribution/
│ ├── __init__.py
│ ├── items.py
│ ├── middlewares.py
│ ├── pipelines.py
│ ├── settings.py
│ ├── spiders/
│ │ ├── __init__.py
│ │ └── amazon_spider.py
├── scrapy-redis/
│ ├── __init__.py
│ ├── connection.py
│ ├── defaults.py
│ ├── dupefilter.py
│ ├── picklecompat.py
│ ├── pipeline.py
│ ├── queue.py
│ ├── scheduler.py
├── .gitignore
├── add_urls.py
├── celery_app.py
├── db.py
├── models.py
├── requirements.txt
├── run_spider.py
├── scrapy.cfg
└── tasks.py
Contributions are welcome! Please fork the repository and submit pull requests.