GitHub - bkeepers/spiderman: your friendly neighborhood web crawler

your friendly neighborhood web crawler

Spiderman is a Ruby gem for crawling and processing web pages.

Installation

Add this line to your application's Gemfile:

gem 'spiderman'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install spiderman

Usage

class HackerNewsCrawler
 include Spiderman

 crawl "https://news.ycombinator.com/" do |response|
   response.css('a.storylink').each do |a|
     process! a["href"], :story
   end
 end

 process :story do |response|
   logging.info "#{response.uri} #{response.css('title').text}"
   save_page(response)
 end

 def save_page(page)
   # logic here for saving the page
 end
end

Run the crawler:

HackerNewsCrawler.crawl!

ActiveJob

Spiderman works with ActiveJob out of the box. If your crawler class inherits from ActiveJob:Base, then requests will be made in your background worker. Each request will run as a separate job.

class MyCrawer < ActiveJob::Base
  queue_as :crawler

  crawl "https://example.com" do |response|
    response.css('a').each {|a| process! a["href"], :link }
  end

  process :link do |response|
    logger.info "Processing #{response.uri}"
  end
end

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/bkeepers/spiderman.

License

The gem is available as open source under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
bin		bin
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.travis.yml		.travis.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
spiderman.gemspec		spiderman.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

your friendly neighborhood web crawler

Installation

Usage

ActiveJob

Development

Contributing

License

About

Releases

Packages

Contributors 2

Languages

License

bkeepers/spiderman

Folders and files

Latest commit

History

Repository files navigation

your friendly neighborhood web crawler

Installation

Usage

ActiveJob

Development

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages