When it comes to data collection, web-crawling (i.e., web-scraping, screen-scraping) is a common approach in our increasingly digital era--and a common stumbling block. With such a wide range of tools and languages available (Selenium, Requests, and HTML, to name just a few), developing and implementing a web-crawling pipeline is often a frustrating experience for researchers--especially those without a computer science background.
Whatever your background, this workshop will give you the foundation to use web-crawling in your research. We will tackle common problems including collecting web addresses/URLs (by automated Google search), downloading website copies (with wget), non-scalable website scraping (with requests), and scalable crawling of text (with scrapy). No web-crawling experience is required, but some Python know-how is expected.
- Understanding the building blocks for digital data collection via web-crawling and -scraping
- Intuitions around the uses and limits of:
- APIs (Application Programming Interfaces)
- Exploiting website structure (HTML/CSS)
- Web-crawling for research at scale
- Knowledge of common problems in web-crawling and their fixes, like:
- Nested websites --> vertical crawling (link extraction)
- Getting blocked --> polite pauses between server requests
- Hands-on skill with:
- Collecting domains to scrape
- Non-scalable website scraping with Requests
- Parsing website text with BeautifulSoup
- Crawling at scale with Scrapy
We will get our hands dirty implementing an assortment of simple web-crawling tools. To follow along with the code—which is the point—will need some familiarity with Python and Jupyter Notebooks. If you haven't programmed in Python or haven’t used Jupyter Notebooks, please do some self-teaching before this workshop using resources like those listed below.
For simplicity, just click the "Launch Binder" button (at the top of this Readme) to create a virtual environment ready for this workshop. It may take a few minutes; if it takes longer than 10, try again.
If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda. Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt
or don't mind installing everything in a virtual environment (best practice if working locally), you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:
pip3 install -r requirements.txt
- Slides (also in folder above)
- Introduction to Jupyter Notebooks (Real Python)
- Quick Python intro (a Jupyter Notebook)
- Great book on Python (with exercises): Python for Everybody (Charles Severance)
- Official Python Tutorial
- Python tutorials for social scientists (Neal Caren)
- Official Scrapy tutorial
- Blog on using item pipelines in Scrapy
- Storing Scrapy output to MongoDB (humongous database!)
- Examples of using wget to download from websites
- Scale Scrapy with a pre-cooked Docker assemblage: Scrapy Cluster
- Video tutorial on APIs, RSS, and Scraping from PyCon
- Parse HTML, by author of Requests library: requests-HTML
- Extract different parts of URL: furl
- Parse CSS files: cssutils
- Parse common formats in newspaper data: newspaper
- Scrape from the past: Internet Archive Python Library
- Browser automation for handling interactive, JavaScript-heavy pages: Selenium
These are available free for some universities, like Georgetown or UC Berkeley (log in here then search for books)
- Popular intro book with Ch. 12 on web-scraping: Automate the Boring Stuff with Python
- Complete scraping intro: Web-scraping with Python
- Detailed look at mechanics and approaches in Scrapy
If you spot a problem with these materials, please make an issue describing the problem or contact Jaren at jhaber@berkeley.edu. If you want to suggest additional resources or materials, please branch and make a pull request!