University of Waterloo CS 486 F20 Team 35 Code Repository
Contributors: @lliepert, @AvyayAgarwal, @hrak109
This repository contains the code used to predict the 2020 US Presidential Election result via sentiment analysis on Reddit submissions. It was submitted by Team 35 as part of the University of Waterloo's CS 486 F20 project.
This repo has two main components: a scraper program, and a Jupyter notebook containing the analysis and prediction. To reproduce, follow the included instructions to set up and run the scraper, then repeat the analysis and prediction using the notebook as a guide.
The scraping program was designed to be run locally. The resulting .csv
s were then loaded into DataBricks, where the analysis and prediction took place. The included Jupyter notebook is a copy of the resulting notebook.
-
Before running the scraper, you must first populate your own
.env
file. See.env.template
for the required format, and see PRAW's documentation here for how to generate the required information. At a high level, you must create a Reddit account and corresponding application which is passed to the PRAW instance in order to query the Reddit API. -
Before you run the scraper, first create the directory
data/
in the same directory wherescraper.py
is located. All data scraped will be placed here. It is recommended to add this directory to your.gitignore
, as the generated files are quite large and will likely exceed Github's recommended file size. -
To run the scraper, run the command
python scraper.py --subreddit politics --query trump --start-date 2016-09-01 --end-date 2016-11-01
with your desired parameters. The scraper will create a new directory in
data/
tagged with the current time, and will save both.csv
and.pkl
copies of the data. Both incremental and a final master dataset are generated. Runpython scraper.py -h
for more information.
Required Python Version: Python 3.8+
Required Python Packages:
# General Requirements
pandas
numpy
# Scraper
dotenv
praw
psaw
# Analysis and Prediction
pyspark
matplotlib
vaderSentiment
textBlob
sklearn