It is aiming to get a list of organizations in Madagascar along with their activity, address and contact from the Yellow Page Africa.
I made this script to help a friend of mine who has recently set up his own startup on the business call campaign.
WARNING: As web scraping is sometimes subject to nuisance to the website and could disrupt its services, I highly recommend to first read and agree on the TOS of the website and experiment this script on your own head be it.
This is an implementation of a Python Scrapy web crawler in which I use Splash to load and render Javascript embedded on pages and Docker container as a middleware serving the fully rendered(includes the necessary javascript events embedded within the different components) pages.
# For conda user
conda install -c conda-forge scrapy
# Or using pip
pip install Scrapy
# Install Splash using pip
pip install scrapy-splash
# Pull the image
sudo docker pull scrapinghub/splash
# Run the container
sudo docker run -p 8050:8050 scrapinghub/splash
In order to use Splash in pga.py
spider script, the following settings have to be in the settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# For Windows & Mac, use the IP address instead of localhost
SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
(Optional) Depending on the scenario, basic page restriction can be bypassed using the following packages along with their settings. They can be used together or separately as they already have different and unique priority values but make sure to put in the DOWNLOADER_MIDDLEWARES
section the corresponding settings.
- The simplest way to legit requests towards the website is to change the User-Agent. This can be done statically in
settings.py
file by initializing this parameter.
USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
scrapy-user-agents
can be used to automate picking a random User-Agent from a pool of pre-defined user agent. Install it viapip
and add the following settings inDOWNLOADER_MIDDLEWARES
section
pip install scrapy-user-agents
#....
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
#....
scrapy-proxy-pool
can also be used to skirt firewall rules via a pool of pre-defined random Proxies. Install it viapip
and add the following settings in theDOWNLOADER_MIDDLEWARES
section
pip install scrapy-proxy-pool
#....
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
#....
Before running this command, make sure that the Docker container has run successfully. Go to the current project and run the following command. The output will be generated as a CSV file yellow-page-data.csv
.
Take a look at the scrapped data sample having 4K+ organizations available here
scrapy crawl pga -o yellow-page-data.csv
Do your own experiment and start generating your own spider by using the following command. For more details, use the documentation
scrapy genspider yourspider yourdomain.com
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.