Learning how to use scrapy to download text data and images from the web.
We will use scrapy to download text data and images from books.toscrape.com, a demo website built for web scraping purposes, which contains data about 1000 books.
This repository is divided into six independent directories: one self-contained scrapy spider and five scrapy projects.
-
01_simple_spider
contains a self-contained spider. When launched, it downloads book data (title and price) and crawls multiple web pages automatically. -
02_book_data_dict
is the simplest possible scrapy project. It downloads book data (item number, title, price and star rating) yielding a simple Python dictionary. -
03_book_data_item
downloads some book data (item number, title, price, star rating, thumbnail path and detailed book url), but it yields a scrapy item instead of a simple dictionary. Moreover the start_requests initialization method is used instead of the simpler start_urls attribute. -
04_book_data_item_loader
uses a scrapy item loader object to load the book data item. This syntax is more complex but allows more flexibility. For example, we can create an item taking data from multiple web pages. Here we download item number, title, price, star rating, thumbnail path, detailed book url, image_url and product description for each book. In order to scrape the detailed book page, a secondary parse method has to be defined. -
05_book_images_thumbnails
downloads both text data and images. Here the images are downloaded following the thumbnail paths of the main page of the website. This project is slightly more complex as it has to use a scrapy item with predefined attributes together with the default scrapy image pipeline, which needs to be activated insetting.py
. Thanks to the pipeline, images are automatically downloaded in theimages/full
directory, and named with a SHA1 hash of their urls (by default). In addition, the pipeline automatically creates thumbnails of the downloaded images inimages/thumb
, as specified insettings.py
. In this example we are creating thumbnails of already small pictures (they were already thumbnails to begin with). -
06_book_images_renamed
is similar to the previous project, but this time we want to rename the downloaded images. To do this, we need to create a custom scrapy pipeline that inherits from the default image pipeline. Here the images are downloaded from the detailed book pages, so they have higher resolution. At every run, they are downloaded automatically inimages/full
. Moreover, thanks to the custom pipeline, the image names are changed from the default hash to the actual book title.
First of all, fork
this repo and then clone
your fork to have a local copy.
To use scrapy you first need to install it. You can create this project's environment with
$ conda env create -f books-to-scrapy.yml
and then activate it with
$ conda activate books-to-scrapy
Alternatively, you can install scrapy directly by running
$ conda install -c conda-forge scrapy
Moreover, to download images you also need the pillow library, which can be installed with
$ conda install pillow
Scrapy comes in with a few very useful command-line commands. To have a look at a list of these commands, simply run the command
$ scrapy
once scrapy is correctly installed. Every scrapy command starts with the scrapy
keyword. Most of the time you have to be in the project root directory (the one with scrapy.cfg configuration file) to run these commands correctly.
$ scrapy startproject <project_name>
$ scrapy genspider <spider_name> <url_domain>
$ scrapy list
$ scrapy runspider <spider_script.py>
The spider will run with the implemented logic. By default, the crawled data will be printed to screen, but you can also specify an optional --output filename.format
(for example -o out.csv
) argument to save it to a file with the requested format. Scrapy automatically implement some output formats, such as csv
, json
, jsonlines
, xml
and pickle
.
$ scrapy crawl <spider_name>
Again, you can save the data with the --output filename.format
optional argument.
This project is licensed under the MIT License. Use it as you wish.