I am working on this web scraper for zoocasa.com and needed some help in getting it to run over multiple pages. There are 2 scripts here, one to scrape the links for the homes/condos (Scrape Toronto Housing Data Links) and one to gather the data from those links and clean it (Scrape Data From Links). Currently, I am generating and rotating random proxies as well as changing user agents on each page, however, I am only able to get the first page of data returned. I am using random crawl rates between 3 and 5 seconds which was suggested in the robots.txt. I am not being blocked by the site but I am always getting duplicates in my dataframe. Any suggestions here are highly welcomed!
-
Notifications
You must be signed in to change notification settings - Fork 0
I am working on this web scraper for zoocasa and needed some help in getting it to run on multiple pages. Currently I am generating and rotating random proxies as well as changing user agents on each page however I am only able to get the first page of data returned. Any suggestions here are highly welcomed!
mikeg5/zoocasa_real_estate_web_scraper
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
I am working on this web scraper for zoocasa and needed some help in getting it to run on multiple pages. Currently I am generating and rotating random proxies as well as changing user agents on each page however I am only able to get the first page of data returned. Any suggestions here are highly welcomed!
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published