proxyfarm
is a node script that scrapes proxy lists from websites without caring for its underlying HTML structure. This allows proxy lists to be easily harvested from a large amount of sources, without implementing custom scraping logic for each source. It does this via using a PhantomJS driver along with the Javascript Selection API. This strips away all HTML tags and makes regex matching trivial. Proxy lists can be used with things like scrapy-proxies in order to bypass IP restrictions and improve web crawling speed.
Simply clone the repository, run npm install
, and node --harmony proxyfarm --in sources.txt --out proxies.txt
NPM module coming soon!
Parameter | Description |
---|---|
in | A text file with line delimited urls to scrape proxies from. See defaults/sources.txt for an example. |
out | The path to save the scraped proxy list to, in the format <host>:<port> |
- Node.js v6.x and later
Coming soon!
There are many ways that you can contribute:
- Improving documentation - Submit a pull request with the fixes.
- Requesting a feature - Simply create a new issue with the said feature.
- Suggesting a proxy list source - Create a new issue mentioning the new source.
- Report a bug - Find a problem? Create an issue with your environment, screenshot of the error, and reproduction steps.
- Fix a bug - All help appreciated!
- Validating the scraped proxy list
- Detecting anonymity, speed, and country of the proxy list
- Automatic crawling of websites rather than manually specifying all proxy lists
- Handling of ajax pages
This project is licensed under the MIT License - see the LICENSE.md file for details