Skip to content

Command line utility that intelligently scrapes proxy lists from various sources.

License

Notifications You must be signed in to change notification settings

clemmy/proxyfarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proxy Farm

proxyfarm is a node script that scrapes proxy lists from websites without caring for its underlying HTML structure. This allows proxy lists to be easily harvested from a large amount of sources, without implementing custom scraping logic for each source. It does this via using a PhantomJS driver along with the Javascript Selection API. This strips away all HTML tags and makes regex matching trivial. Proxy lists can be used with things like scrapy-proxies in order to bypass IP restrictions and improve web crawling speed.

demo

Getting Started

Simply clone the repository, run npm install, and node --harmony proxyfarm --in sources.txt --out proxies.txt

NPM module coming soon!

Arguments

Parameter Description
in A text file with line delimited urls to scrape proxies from. See defaults/sources.txt for an example.
out The path to save the scraped proxy list to, in the format <host>:<port>

Prerequisites

  • Node.js v6.x and later

Running the tests

Coming soon!

Contributing

There are many ways that you can contribute:

  • Improving documentation - Submit a pull request with the fixes.
  • Requesting a feature - Simply create a new issue with the said feature.
  • Suggesting a proxy list source - Create a new issue mentioning the new source.
  • Report a bug - Find a problem? Create an issue with your environment, screenshot of the error, and reproduction steps.
  • Fix a bug - All help appreciated!

Future Roadmap

  • Validating the scraped proxy list
  • Detecting anonymity, speed, and country of the proxy list
  • Automatic crawling of websites rather than manually specifying all proxy lists
  • Handling of ajax pages

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Command line utility that intelligently scrapes proxy lists from various sources.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published