Crawling@Home GPU controlled Hetzner Cloud swarm of scrapers

Help us build a billion-scale image-caption dataset by filtering Common Crawl with OpenAI CLIP. At the time of this writing we are up to 5 billion high quality pairs ready for training various models but we still expect your help to advance to the potential 6 billion quality pairs estimated to exist in the commoncrawl data. This dataset is intended for public use and towards a truly open access to AI for everyone !

Concept

This image-text scraping task comes with specific characteristics: link lists might be old and images might not be online anymore, even entire domains might be missing. Also there are seldom multiple links pointing to the same domain, so the DNS queries are many and often. Finally after the actual scraping there is a computational intensive task to calculate similarities between images themselves and their captions.

On a normal CPU machine, scraping and filtering take almost the same time. On a GPU though filtering is much faster, in order of 60x faster than on single CPU.

Hence this concept for crawling@home where we created a data pipeline on 3 levels:

commoncrawl preprocessing, where we use a swarm of about 500 cpus to download, parse and send results to a database node with candidates for our dataset, meaning image urls with alt text, plus the detected language using gcld3. By the language detection we split the candidates into English, Multilanguage (non English) and Nolang (language not detected with confidence) categories.
image downloading and inspection, prefiltering by image type and resolution, producing further candidates for CLIP or mCLIP inference
CLIP style inference where we calculate similarity of image embeddings with text embeddings and retain only pairs with higher similarity than a manually set threshold

Common Crawl jobs are coordinated by a tracker with dashboard at http://cah.io.community/

Cloud workers

We used AWS workers for first level of the above pipeline, Hetzner and Alibaba workers for the second level and home GPU plus AWS GPU nodes for the third level.

Thus the code migrated to:

Hetzner swarm control: use infrastructure.py to control the swarm at Hetzner Cloud via commands like python3 infrastructure.py up 20 fsn1 where up means bring up swarm, 20 is the desired number of nodes, and fsn1 is the desired datacenter location.
Alibaba swarm control: due to cost restrictions we used Simple Application Servers with Alibaba, and developed a limited scope control script
CPU clients: a) ccpp.py is used to preprocess common crawl wat files. Nodes require minimum one CPU core and 1GB RAM for each CPU. b) dbdl.py is used to download images. Nodes require minimum one CPU core and 1GB RAM for each CPU.
GPU clients only consume max 3.5GB of GPU VRAM so any nVidia GPU card with 4GB VRAM or more is deemed compatible: a) run python3 gpu_inference.py from any Linux based PC with an Nvidia GPU and correct drivers installed

If you want to install on your own box, then

Prerequisites

Ubuntu box with 4GB+ Nvidia GPU
Nvidia driver installed
Cuda toolkit 11+ (also corresponding cudnn is recommended for future)
check driver installation with nvidia-smi command
your user is able to run sudo commands
install python3-pip and git packages

Distributed infrastructure setup and run

Make an account at Hetzner Cloud (https://www.hetzner.com/) and issue an API token
create the .env file and paste your HCLOUD API key in it. optionally, if you have more than one account, paste all API keys each on a separate line
bring up infrastructure at any time with python3 infrastructure.py up N in order to raise N nodes. It will scan all API keys and create maximum available servers on each until N limit is met
tear down infrastructure at any time with python3 infrastructure.py down in order to shutdown things (and save cash). this will shut down all cloud servers that belong to all API tokens saved in the .env file. Be aware, this command will delete all servers in the accounts even if they are NOT related to this project !!!

If you wish to SSH into any droplet you can use this command: ssh -oStrictHostKeyChecking=no -oIdentitiesOnly=yes -i~/.ssh/id_cah crawl@<<droplet_ip>>. The crawling script is ran as a service, check logs with tail -f crawl.log. Access service status or commands with sudo systemctl stop|restart|start crawl

If you are asked for any droplet root password at any time, it means you need to rerun git pull and source conda-setup.sh to refresh the files and regenerate the ssh keys pair.

How to run GPU node from home computer

run git clone https://github.com/rvencu/crawlingathome-gpu-hcloud, to download crawlingathome GPU node script
run cd crawlingathome-gpu-hcloud, to enter the newly created directory
run source conda-setup.sh to setup the environment if you use anaconda. otherwise use source pip-setup.sh. the script will ask for a nickame to be used on leaderboard as well as for the sudo password
run gpu_inference.py. The script will run in a loop that can be interrupted at any time with Ctrl-C.

This work is based on code written by:

This is a subproject ran by the community around https://github.com/lucidrains/DALLE-pytorch

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
alibaba_workers		alibaba_workers
audio		audio
bloom_server		bloom_server
cloud boot		cloud boot
docs		docs
helpers		helpers
notebooks		notebooks
postCLIP_staging		postCLIP_staging
postgres		postgres
preCLIP_staging		preCLIP_staging
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ccpp.py		ccpp.py
cleanup.py		cleanup.py
dbdl.py		dbdl.py
gpu-requirements.txt		gpu-requirements.txt
gpu-setup.sh		gpu-setup.sh
gpu_inference.py		gpu_inference.py
infrastructure.py		infrastructure.py
worker-requirements.txt		worker-requirements.txt
worker-setup.sh		worker-setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawling@Home GPU controlled Hetzner Cloud swarm of scrapers

Concept

Cloud workers

Prerequisites

Distributed infrastructure setup and run

How to run GPU node from home computer

About

Releases 3

Packages

Contributors 2

Languages

License

rvencu/crawlingathome-gpu-hcloud

Folders and files

Latest commit

History

Repository files navigation

Crawling@Home GPU controlled Hetzner Cloud swarm of scrapers

Concept

Cloud workers

Prerequisites

Distributed infrastructure setup and run

How to run GPU node from home computer

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

Packages