Web Scraper used to scrape Amazon Product Pages and avoid capchas.
Author: Julius Remigio
Install required libraries:
pip install -r requirements.txt
See scrapy documentation:
The proxies in proxy.txt was created using free public proxies that were available at the time of scraping. It should be updated regularly with working proxies to increase rate of success.
List of public proxies: http://proxylist.hidemyass.com/
Sessions are started using scrapy CLI utility.
Custom parameters are passed using the -a
parameter.
Custom Parameters:
- file - csv file with column header 'asin' (list of amazon products to scrape)
- html - folder to store html of scraped products
Example:
scrapy crawl product -a html=./../../html -a file=./../reviews_Women.csv.gz -o ./../reviews_Women.jl --logfile ./../reviews_Women.csv.log
Used to changing scraping behavior such as retries and middleware configuration
spider directory contains all spider classes. Currently there is only a products spider for scraping amazon product pages.
Notebooks are used for transforming the data and preparing it for model consumption.