GitHub - HongbinZhou/rbspider: a web scrawler in Ruby based on rspider

rbspider

rbspider is a web scrawler based on rpsider, here are some of its advantage:

easy to customize your owner parser
easy to config your crawler job
easy to save to your data since rbspider take advantage of Activerecord

how rbspider works

rbspider is not only a spider crawl the links, rbspider is a mini framework to get content from some specific website, so rbspider is consisted of a crawl engine, parse engine, database engine.

crawl engine

our crawl is totally based on spidr. Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
parse engine

we are using Nokogiri as the content parse engine, so the most work when we are start using rbspider is to implement a parse engine for your specific website.
database engine

we are using Activerecord to manipulate the data.

how to start

we have already implemented some parser for some websites already, so you can just type:

 $ ruby run.rb -c config.json

if you want crawl a website, you need to implement a parse class (Ruby) inherited from ParserEngine, then implement a result class (Ruby) inherited from ParsersResult. then update the config.json, yeah, that is it, you can just type:

  $ ruby run.rb -c config.json

all the data will save to sqlite db file.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
database		database
parser		parser
test		test
.gitignore		.gitignore
Gemfile		Gemfile
README.md		README.md
TODO.org		TODO.org
config.json		config.json
run.rb		run.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rbspider

how rbspider works

how to start

About

Releases

Packages

Languages

HongbinZhou/rbspider

Folders and files

Latest commit

History

Repository files navigation

rbspider

how rbspider works

how to start

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages