The Ion Cannon for scraper, with proxy, robust logic control, parallelization, Sequelize data model.
Use with caution. You can spawn 10 or 100,000 scraping instances with just a change of param, but please be considerate and don't DDOS people.
- fork and clone this repo
- install dependencies with
npm run setup
(this runsnpm i
too) - setup
config/
for your mysql db, runnpm i -g sequelize-cli
Scrapings are organized as projects, which are constructed from the same projects/base-project.js
framework with its logic. This allows you to only specify a minimal project-specific details under projects/
, and run.
An example project projects/proxy.js
is already included. Written in 70 lines - it shows the setup speed of this tool. Also, the proxy
project is used internally for automatic proxy scraping, which gets the list of proxies for your other projects.
Specify projects to run in index.js
, and run npm start
.
First, setup your project. We will use the core Proxy
project as example.
-
create a new project in
projects/proxy.js
. -
create new db models for the
ProxyTarget
andProxyData
from the terminal. Edit thedb/migrations
ordb/models
as needed:# for scraping target sequelize model:create --name ProxyTarget --attributes "url:string success:boolean freq:integer" # for scraping data storage sequelize model:create --name ProxyData --attributes "url:string ip:string country:string speed:integer anonimity:string usable:boolean"
-
run db migration:
npm run setup
-
setup the project
projects/proxy.js
.- import
Project, ProxyTarget, ProxyData
. - implement project
spec
with the example keys - implement the
scrape
method for scraping and data parsing, insertion logic - construct a new project instance
const project = new Project(spec, ProxyTarget, ProxyData, scrape)
- export the project
- import
-
import the project in
index.js
and specify the steps to run, options areresetTarget, resetData, reset, run, stop
. Refer toprojects/base-projects.js
for details on these methods. -
run
npm start
Sample log on running:
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] WARN Clearing only Target DB for Project: Proxy
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] INFO Project Target rows: 0
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] INFO Initialize project: Proxy
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] INFO Start project
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] INFO Spawn a scraper instance for target: https://incloak.com/proxy-list/
[Thu Dec 29 2016 13:36:34 GMT-0500 (EST)] INFO Spawn a scraper instance for target: https://incloak.com/proxy-list/
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Data scraping successful with 74 rows for target: https://incloak.com/proxy-list/
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Data scraping successful with 74 rows for target: https://incloak.com/proxy-list/
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Project report:
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Total Project Data rows: 186
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Total Project Target hit: 1
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Total Project Target remain: 0
[Thu Dec 29 2016 13:36:40 GMT-0500 (EST)] INFO Stop project
npm run lint
- scrape Flyertalk forum as first target