webscraping

Winter 2023 MDST Webscraping Project

Introduction

Have you ever felt like stalking someone's 📱insta485~~gram~~📱, but were too embarrassed to do it while logged into your account? Maybe you wanted to pull quotes from someone's favorite 🎥movie🎥 so you could ~~annoy~~ serenade him with a random quote every day! Now imagine these kinds of tasks, but on a much larger scale. Some more MDST examples...

Scraping SEC filings over a quarter to analyze trends of inside traders in the SEC Insider Trading Project
Scraping metadata about ~9000 movies from IMDB for a dataset to create a Movie Recommender System
Countless lightning talks

Description

Since there is no one way to scrape websites, we won't have just one project that we work on the entire semester. Instead, we have a few mini projects (one is completely self-guided) to give us experience scraping many different kinds of websites. This should give us some appreciation for the work google does making their crawlers work.

The culminating project is a unified app that scrapes information about all UofM professors from their websites (and cross references this with relevant reviews from Atlas). One use case of this is to show open research positions professors have, while checking their teacher experience.

Goals

Webscrape structured and unstructured data and what are good ways to display/visualize it
Dive into a self-guided mini project that interests YOU (⚡ talk?)
Create a "one-stop shop" that UofM students could use to search for research positions in areas they are interested in
Have fun and learn something! 😃

A Look at the Data

We scrape our data!

Project Roadmap

Week of 1/29: Intro to Webscraping

Kickoff!
Introductions
Familiarize ourselves with BeautifulSoup

Week of 2/5: Scrape well-tabulated websites

MLB website
Tennis rankings
Pretty much any competitive sport
instagram

Weeks of 2/12-3/12: Begin individual projects

Sub-teams!
Find something to scrape
(At some point) Intro to Selenium (interactive webscraping)

2/25-3/5: Spring Break!

Week of 3/19: Wrap up individual projects

Make visualizations of our data

Week of 3/26-4/16: Develop Michigan Web Crawler

Plan out application design
Flesh out basic API to interact with webpage
Test it!

Week of 4/16: Finishing Touches

Complete the write-up
Prepare for final presentations!

Week of 4/23: Final Expo

Show what we've been working on!

Setup

First, clone this repo (via ssh)

git clone git@github.com:MichiganDataScienceTeam/webscraping.git

Virtual Environment

You can choose whether or not to use a virtual environment for this project (though it is recommended). The setup guide shows how to create a venv through pip, but it can also be done via Conda if you want. The important thing is that you can run the commands found in the Good to go section.

We are going to initialize a Python virtual environment with all the required packages. We use a virtual environment here to isolate our development environment from the rest of your computer. This is helpful in not leaving messes and keeping project setups contained.

First create a Python 3.8 virtual environment. The virtual environment creation code for Linux/MacOS is below:

python3 -m venv venv

Now that you have a virtual environment installed, you need to activate it. This may depend on your system, but on Linux/MacOS, this can be done using

source ./venv/bin/activate

Now your computer will know to use the Python installation in the virtual environment rather than your default installation.

After the virtual environment has been activated, we can install the required dependencies into this environment using

pip install -r requirements.txt

Good to go

If it is set up correctly, you should be able to open a dev server and see the app for some intro webscraping by moving to the "flaskr" directory and then running the app:

cd flaskr
flask run

Open up the server to see if it works! (ctrl + click on http://127.0.0.1:5000)

Other relevant stuff

MDST Calendar

Required Skills

Intermediate Python, Pandas (enough that it won't impede progress)

Learned Skills

HTML, CSS, BeautifulSoup, Selenium, RegEx

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
final_project		final_project
flaskr		flaskr
individual_project		individual_project
meetings		meetings
utils		utils
.gitignore		.gitignore
README.md		README.md
meeting1.py		meeting1.py
meeting1_sample_solutions.py		meeting1_sample_solutions.py
requirements.txt		requirements.txt
selenium_practice.ipynb		selenium_practice.ipynb
selenium_practice.py		selenium_practice.py
selenium_starter.ipynb		selenium_starter.ipynb
selenium_starter.py		selenium_starter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webscraping

Table of Contents

Introduction

Description

Goals

A Look at the Data

Project Roadmap

2/25-3/5: Spring Break!

Setup

Virtual Environment

Good to go

Other relevant stuff

Required Skills

Learned Skills

About

Releases

Packages

Contributors 4

Languages

MichiganDataScienceTeam/W23-Webscraping

Folders and files

Latest commit

History

Repository files navigation

webscraping

Table of Contents

Introduction

Description

Goals

A Look at the Data

Project Roadmap

2/25-3/5: Spring Break!

Setup

Virtual Environment

Good to go

Other relevant stuff

Required Skills

Learned Skills

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages