This repository holds the code for scrapers built under the project "Scrape the Planet"
- Methods used for scraping : Scrapy
- Language used for scraping : Python3.X.X
Minutes of the meeting: http://bit.ly/scrapeThePlanet
Clone the repository (or download it). Then, follow the installation steps to run the spiders.
python3 -m venv VENV_NAME
Windows: VENV_NAME/Scripts/activate
Linux: source VENV_NAME/bin/activate
Navigate to repository: pip3 install -r requirements.txt
-
Requirements(For scraping):
- scrapy
- requests
- python-dateutil
-
Requirements(For database):
- psycopg2
-
Requirements(For flask Application):
- flask
Note: You can comment out the following code in settings.py to avoid using pipelines.
ITEM_PIPELINES = {
'scrapeNews.pipelines.ScrapenewsPipeline': 300
}
-
Installation in Debian:
sudo apt-get install postgresql postgresql-contrib
-
Configurations:
- config:
/etc/postgresql/9.5/main
- data:
/var/lib/postgresql/9.5/main
- socket:
/var/run/postgresql
- port:
5432
- config:
-
Make User: Note: Your USERNAME and PASSWORD must contain only smallcase characters.
sudo -i -u postgres
createuser YOUR_ROLE_NAME/YOUR_USERNAME --interactive --pwprompt
-
Setup Database:
- Create file a scrapeNews/envConfig.py; Inside it, Write:
USERNAME = 'YOUR_ROLE_NAME/YOUR_USERNAME' PASSWORD = 'YOUR_PASSWORD' NEWS_TABLE = 'NEWS_TABLE_NAME' SITE_TABLE = 'SITE_TABLE_NAME' LOG_TABLE = 'LOG_TABLE_NAME' DATABASE_NAME = 'DATABASE_NAME' HOST_NAME = 'HOST_NAME'
Note: Navigate to the folder containing scrapy.cfg
scrapy crawl SPIDER_NAME
-
SPIDER_NAME List:
- indianExpressTech
- indiaTv
- timeTech
- ndtv
- inshorts
- zee
- News18Spider
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- hindustan
- asianage
- timeNews
- newsNation [In development]
-
Options:
-
To set the number of pages to be scraped use
-a pages = X
(X = Number of pages to scrape). Applicable for:- indianExpressTech
- indiaTv
- timeTech
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- asianage
- ndtv
- timeNews
-
To set the number of pages to be scraped use
-a offset = X
(X = Number of pages to skip). Applicable for:- indianExpressTech
- indiaTv
- timeTech
- moneyControl
- oneindia
- oneindiaHindi
- firstpostHindi
- firstpostSports
- newsx
- asianage
- timeNews
-
Happy collaborating !!