Cluster project to crawl the BitTorrent DHT network and download torrent file metadata from remote BitTorrent clients.
Run a number of p2pspider instance hosts (crawlers) to gather DHT infohashes, download their metadata and send them on a Redis-based Celery instance (collector) which verifies the torrent files and stores them to disk.
- Python 3.5 (and pip)
- Node.js >=0.12.0 (and npm)
- Redis Server
- aws cli (for docker image pushing)
- aria2 (for torrent indexing script)
- DEVELOPMENT
First, install the Redis daemon and start it. Then start the Celery worker which will connect to the Redis instance on localhost:
$ cd collector
$ pyvenv venv
$ source venv/bin/activate
(venv) $ pip3 install -r requirements.txt
(venv) $ ./tasks.py
Collected torrent files will be saved as torrents/[INFOHASH].torrent
$ cd crawler
$ npm install
$ npm start
This will start the worker with the default broker of redis://localhost/0
. To use another one, e.g. a remote host, set the BROKER
environment variable:
$ BROKER=redis://[HOSTNAME]/0 npm start
- DEPLOYMENT WITH DOCKER
Because the programs within the containers run with different uids, you may need to make the torrents
target directory writeable before:
$ chmod a+w torrents
If you intend to crawl for a while, you may run out of inodes because of the many small torrent files. It's a good idea to (loop) mount a filesystem with a higher inode count than the default. The usage type news
creates one inode for every 4k block, which should be plenty.
$ dd if=/dev/zero of=torrents.img bs=1M count=1024
$ mkfs.ext4 -T news torrents.img
# mount torrents.img torrents
# chmod -R a+w torrents
Building the images and starting the cluster locally (one crawler, one collector instance):
$ docker-compose build
$ docker-compose up
To start the optional Celery Flower web monitor, issue docker ps
to find the ID of the running collector container and then issue:
$ docker exec [CONTAINER-ID] supervisorctl start celery-flower
Afterwards it should be reachable via localhost:5555.
For the developer guide on ECS, see here. These are the basic steps for setting it up:
In the IAM console:
- Create an ECS instance role with the
AmazonEC2ContainerServiceforEC2Role
policy attached
In the ECS console:
- Create two container registry repositories:
collector
andcrawler
- Build, tag and push the two Docker images
Configure the aws cli by running aws configure
. Then run either the aws/push-containers.sh
script or follow the push instructions manually (given on the EC2 container registry site).
- Define tasks definitions
Create a collector
and a crawler
task using the supplied JSON task definitions. Be sure to replace the [REGISTRY-URI]
field with your own ECS repository URI, and replace the [COLLECTOR]
field with the private IP/hostname of the EC2 instance where the Collector container runs.
- Define cluster
Create a service for each task definition. Set the Number of tasks
to 0 for now.
- Select an Amazon ECS-optimized AMI from the Community AMIs
- Set the created ECS instance role as IAM role for the instance
- Insert the setup script
aws/setup-ec2-instance.sh
as user data - Use a security group that allows port 6881 UDP/TCP inbound from the internet (and the Redis Port 6379 TCP between the instances)
- You will need to create and run [CrawlerCount]+1 EC2 instances
- Collector service: Set
Number of tasks
to 1. - ECS should find a suitable EC2 instance and start a collector container. Check that a collector container has started, note the private IP of collector EC2 instance and update the
crawler
task definition so thatBROKER
env variable points to the correct collector hostname/IP
(This manual step could be eliminated by using a proper ECS service discovery solution). - Crawler service: Set
Number of tasks
to some number in [1,*].
- LOGS
To see the collector status, connect to the collector host, issue docker ps
to find the ID of the running collector container and then issue:
$ docker logs -f [CONTAINER-ID]
To see only the stats, use the following. They are printed every minute and show the cumulative values of the last 10 minutes.
$ docker logs -f [CONTAINER-ID] 2>&1 | grep print_stats
For a typical session, the stats may look like this:
Received task: tasks.print_stats[17e03d9a-23b9-4be5-86c4-ebc020f672f4]
tasks.print_stats: last 10.0 mins: 1800 torrents / 51.2 MB saved
tasks.print_stats: Rate: 180.0/m, 10799.5/h, 259.2k/d
tasks.print_stats: Storage consumption rate: 5.1 MB/m, 0.3 GB/h, 7.2 GB/d
Task tasks.print_stats[17e03d9a-23b9-4be5-86c4-ebc020f672f4] succeeded in 0.04759194100006425s: {'saved': 1800, 'size': 53672251, 'save_rate': 2.999862196330241, 'size_rate': 89449.64264824889, 'running': 600.027562, 'start': 1482410075.418662}