Skip to content

arunKumarNOP/imdb_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

A Scrapy project coded in python to crawl Imdb website and construct the recommendation database which can be used for various purposes and analysis.
With this project you can get a basic ideas such as:

  • Implementing a User-Agent switcher for scrapy crawler (using Middleware)
  • Saving crawled data in a database, here SQLite is used (using Pipeline)
  • Using different Item holder for different type of data
  • Implementing Pause and Resume feature, by serialization and deserialization

Installation

You need scrapy and python 2.7 to run this project.

Install python for your platform and then install scrapy with:

sudo pip install scrapy

Usage

Navigate to top of the project directory and start the crawl with:

scrapy crawl imdb -s SAVE_FOLDER="crawled_data" -s FILE_NAME="sample_movie_ids"

SAVE_FOLDER is the folder name where to save the crawled information and
FILE_NAME is the file which contains the list of starting movie ids

A sample file named "sample_movie_ids" which contains movie ids.

A movie id is extracted from the Imdb url.

Ex:

Imdb url: http://www.imdb.com/title/tt0120338
then movie_id: tt0120338

About

IMDB crawler to crawl recommendations using Scrapy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages