zoocasa_real_estate_web_scraper

I am working on this web scraper for zoocasa.com and needed some help in getting it to run over multiple pages. There are 2 scripts here, one to scrape the links for the homes/condos (Scrape Toronto Housing Data Links) and one to gather the data from those links and clean it (Scrape Data From Links). Currently, I am generating and rotating random proxies as well as changing user agents on each page, however, I am only able to get the first page of data returned. I am using random crawl rates between 3 and 5 seconds which was suggested in the robots.txt. I am not being blocked by the site but I am always getting duplicates in my dataframe. Any suggestions here are highly welcomed!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Neighbourhoods.geojson		Neighbourhoods.geojson
README.md		README.md
Scrape Data From Links .ipynb		Scrape Data From Links .ipynb
Scrape Toronto Housing Data Links.ipynb		Scrape Toronto Housing Data Links.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zoocasa_real_estate_web_scraper

About

Releases

Packages

Languages

mikeg5/zoocasa_real_estate_web_scraper

Folders and files

Latest commit

History

Repository files navigation

zoocasa_real_estate_web_scraper

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages