Winter 2023 MDST Webscraping Project
Have you ever felt like stalking someone's 📱insta485gram📱, but were too embarrassed to do it while logged into your account? Maybe you wanted to pull quotes from someone's favorite 🎥movie🎥 so you could annoy serenade him with a random quote every day! Now imagine these kinds of tasks, but on a much larger scale. Some more MDST examples...
- Scraping SEC filings over a quarter to analyze trends of inside traders in the SEC Insider Trading Project
- Scraping metadata about ~9000 movies from IMDB for a dataset to create a Movie Recommender System
- Countless lightning talks
Since there is no one way to scrape websites, we won't have just one project that we work on the entire semester. Instead, we have a few mini projects (one is completely self-guided) to give us experience scraping many different kinds of websites. This should give us some appreciation for the work google does making their crawlers work.
The culminating project is a unified app that scrapes information about all UofM professors from their websites (and cross references this with relevant reviews from Atlas). One use case of this is to show open research positions professors have, while checking their teacher experience.
- Webscrape structured and unstructured data and what are good ways to display/visualize it
- Dive into a self-guided mini project that interests YOU (⚡ talk?)
- Create a "one-stop shop" that UofM students could use to search for research positions in areas they are interested in
- Have fun and learn something! 😃
We scrape our data!
Week of 1/29: Intro to Webscraping
- Kickoff!
- Introductions
- Familiarize ourselves with BeautifulSoup
Week of 2/5: Scrape well-tabulated websites
- MLB website
- Tennis rankings
- Pretty much any competitive sport
Weeks of 2/12-3/12: Begin individual projects
- Sub-teams!
- Find something to scrape
- (At some point) Intro to Selenium (interactive webscraping)
Week of 3/19: Wrap up individual projects
- Make visualizations of our data
Week of 3/26-4/16: Develop Michigan Web Crawler
- Plan out application design
- Flesh out basic API to interact with webpage
- Test it!
Week of 4/16: Finishing Touches
- Complete the write-up
- Prepare for final presentations!
Week of 4/23: Final Expo
- Show what we've been working on!
First, clone this repo (via ssh)
git clone git@github.com:MichiganDataScienceTeam/webscraping.git
You can choose whether or not to use a virtual environment for this project (though it is recommended). The setup guide shows how to create a venv through pip, but it can also be done via Conda if you want. The important thing is that you can run the commands found in the Good to go section.
We are going to initialize a Python virtual environment with all the required packages. We use a virtual environment here to isolate our development environment from the rest of your computer. This is helpful in not leaving messes and keeping project setups contained.
First create a Python 3.8 virtual environment. The virtual environment creation code for Linux/MacOS is below:
python3 -m venv venv
Now that you have a virtual environment installed, you need to activate it. This may depend on your system, but on Linux/MacOS, this can be done using
source ./venv/bin/activate
Now your computer will know to use the Python installation in the virtual environment rather than your default installation.
After the virtual environment has been activated, we can install the required dependencies into this environment using
pip install -r requirements.txt
If it is set up correctly, you should be able to open a dev server and see the app for some intro webscraping by moving to the "flaskr" directory and then running the app:
cd flaskr
flask run
Open up the server to see if it works! (ctrl + click on http://127.0.0.1:5000)
Intermediate Python, Pandas (enough that it won't impede progress)
HTML, CSS, BeautifulSoup, Selenium, RegEx