W2022 MDST Project: Building a Movie Recommender System
- Introduction
- Description
- Goals
- Stretch Goals
- A Look at the Data
- Project Roadmap
- Setup
- Relevant Links
What do you do when you want to watch a movie, but don't know what to watch? Like... really don't know what to watch? What drives your decision-making for watching a particular movie?
- Did a friend suggest it to you?
- Do you google watchlists?
- Maybe you take buzzfeed quizzes for recommendations?
If you ever felt unsatisfied with movie recommendation engines, or just want to learn more about how they work, then this is the project for you!
The goal of this project is to make a functional recommender system and learn how and why it recommends the movies it does. The two main kinds of recommender systems we plan to explore are content-based and collaborative filtering (more information can be found here). These programs will be used as engines to drive an online quiz (similar to buzzfeed quizzes) to give ~10 movie recommendations.
Here are some of the relevant data science buzzwords and jargon for this project!
- Regression
- (Un)supervised Learning
- K-Nearest Neighbors
- Matrix Factorization (Asymmetric SVD)
- Naive Bayes
- Recommender System
- Design a functional recommender system from scratch and gain insight to their mechanics
- Provide MDST members the opportunity to work with recommender systems that are very prevalent in industry
- Have a user interface (form of a website)
- Have fun and learn something! 😃
- Augment movie preferences with Bayesian conditional probability scores
- Test recommender systems on larger datasets
- Incorporate nonrating data along side ratings to boost prediction performance
- Predict genre from movie preferences by analyzing latent factors
The data is from the Movie Lens | Group Lens dataset. The dataset can also be obtained through TensorFlow. The main focus will be on the 100k dataset.
Week of 1/30: Learn Our Data
- Kickoff!
- Introductions
- Exploratory Data Analysis
Week of 2/6: Methodology
- Data cleaning
- More EDA
- Basic modeling
- Introduction to algorithms (kNN, Matrix Factorization)
Weeks of 2/13-3/13: Build Models
- Sub-teams!
- In-depth analysis of algorithms
- Development of algorithm specification
- Building, training, and testing models
- Create visualizations
Week of 3/20-3/27: Refine Models
- Evaluate and run models
- Preliminary results
- Create visualizations
Week of 3/27-4/3: Develop Quiz Application
- Plan out application design
- Flesh out basic API to interact with webpage
- Test it!
Week of 4/10: Finishing Touches
- Complete the write-up
- Final Presentations!
Getting all setup to contribute to this project is as simple as a few commands.
We are going to initialize a Python virtual environment with all the required packages. We use a virtual environment here to isolate our development environment from the rest of your computer. This is helpful in not leaving messes and keeping project setups contained.
First create a Python 3.8 virtual environment. The virtual environment creation code for Linux/MacOS is below:
python3 -m venv venv
Now that you have a virtual environment installed, you need to activate it. This may depend on your system, but on Linux/MacOS, this can be done using
source ./venv/bin/activate
Now your computer will know to use the Python installation in the virtual environment rather than your default installation.
After the virtual environment has been activated, we can install the required dependencies into this environment using
pip install -r requirements.txt
If you also want to install dependencies of the development environment like code formatters and Jupyter notebook, run
pip install -r requirements-dev.txt
Getting the MovieLens dataset this project utilizes is not too difficult as well. With your virtual environment activated, run
python setup.py
That's all! You'll find the extracted dataset in the data
folder. If you'd like more control over where you want to download and extract the dataset, use the download
and extract
options:
python setup.py --download <custom_filepath> --extract <custom_filepath> <custom_extraction_dir>
All download options can be viewed using
python setup.py --help
- M1 Mac users may have trouble installing Scipy through pip due to problems with support for BLAS (Basic Linear Algebra Subprograms) There are two options:
- Remove
seaborn==0.11.2
from the dependencies and instead use matplotlib for visualization functionality - Manually install openBLAS and compile Scipy from scratch (not recommended - we likely cannot help you debug any issues with this)
- Remove
Dataset:
Resources: