The current repository is a web-scraping practice in Python for the subject "Type and life-cycle of data" for the Data Science Master Degree of the Open University of Catalonia (UOC).
The practice makes use of Scrapy, an open source and collaborative framework for extracting the data you need from websites. In this particular case, the spider crawls all active animal shelters that use Bambu CMS to promulgate the adoption of rescued animals and saves the data in a Comma Separated Values file (CSV).
Just fork or download the repository and make sure you have the latest version of Python and install Scrapy:
> pip install scrapy
Then, in the Scranimal folder, just run the script called:
> ./run.sh
The crawler will visit Bambu page where the active shelter using this CMS are displayed. Then, it will visit all the pages where the animals that need adoptions are listed and, afterwards, it will visit every pet profile.
If you want the data file in other format than CSV, open a prompt terminal and run the next sentence in the Scranimal folder:
scrapy crawl adopting -o filename.ext -t FEED_FORMAT
Where FEED_FORMAT can be:
- json
- jsonlines
- csv
- xml
- pickle
- marshal
The source code can be found under the Scranimal folder. The most important files are:
- scranimal/settings.py: crawler settings, like bot name, user agent, throttle time, number of concurrent request, logging file, etc.
- scranimal/spiders/adoptingSpider.py: the crawler implementation code. Function parse is the main function used by Scrapy to start the crawler.
This project is licensed under CC BY-NC-SA 4.0 license (Attribution-NonCommercial-ShareAlike).
- Share — copy and redistribute the material in any medium or format.
- Adapt — remix, transform, and build upon the material
-
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
-
NonCommercial — You may not use the material for commercial purposes.
-
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
-
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.