a collection of all personal web crawler projects
- settings.py and middlewares.py have been improved to enable rotating proxy and random user agent functionality
scrapes HomeFinder, Realtor and Homes sites for real estate listing information by City and State terms
aggregates data to one master list, joined on full address and preserving sources
accesses website's structured json response instead of referencing html
- see homefinder_spider.py for spider code
- see homefinder_data.json for its sample output
- see realtor_spider.py for spider code
- see realtor_data.json for its sample output
- see homes_spider.py for spider code
- see homes_data.json for its sample output
- see merge_data.py for data aggregator code
- see master_list.csv for its sample output
scrapes from the Steam Top Sellers list and outputs curated deals (under $10) in an email to the user
- see prices_spider.py for spider code
- see scrape_send.py for emailer code with AWS SES
scrapes HackerNews article titles, source links, and upvote points
- uses pagination to access subsequent articles pages
- see hackernews_spider.py for spider code
- see news_data.json for sample output in json
scrapes Amazon market results by search term.
user can provide category= <some-search-term>
in cmd line to scrape that term's results
- see amazon_spider.py for spider code
- see amazon_data.json for sample output in json