This is a rewrite of my public proxy farm. It uses redis to record and store reliability statistics for publicly available proxy servers.
After recording sufficient data, it is able to create a database of proxy servers and choose the most reliable proxy to use for crawling a given website.
- docker
- docker-compose
When web crawling, proxies are essential for maintaining anonymity and circumventing bot detection. There are a number of free public proxy servers spread across the Internet, however their performance is inconsistent. This project utilizes a redis store to temporarily store and cache proxy server information for use by a scrapy middleware. This cache is then periodically synced to a Postgres database intended to be a more permanent and practical storage medium for proxy statistics.
I'm still working on this, but here's how to run it:
git clone https://github.com/dchrostowski/autoproxy.git
cd autoproxy
docker-compose build scrapyd
docker-compose build spider_scheduler
docker-compose up scrapyd spider_scheduler
There are a few spiders (see autoproxy/autoproxy/spiders) that are scheduled to crawl a few sites to constantly pull in more proxies and then test those proxies against the sites they've scraped.
To access the Postgres database, you can run the following:
docker exec -it autoproxy_db psql -U postgres proxies
The default password is somepassword
I'm planning on publishing the autoproxy_package/ contents as a module/package eventually.