Skip to content

Latest commit

 

History

History
48 lines (36 loc) · 2.42 KB

README.md

File metadata and controls

48 lines (36 loc) · 2.42 KB

stunning-eureka

University of Waterloo CS 486 F20 Team 35 Code Repository
Contributors: @lliepert, @AvyayAgarwal, @hrak109

This repository contains the code used to predict the 2020 US Presidential Election result via sentiment analysis on Reddit submissions. It was submitted by Team 35 as part of the University of Waterloo's CS 486 F20 project.

Description

This repo has two main components: a scraper program, and a Jupyter notebook containing the analysis and prediction. To reproduce, follow the included instructions to set up and run the scraper, then repeat the analysis and prediction using the notebook as a guide.
The scraping program was designed to be run locally. The resulting .csvs were then loaded into DataBricks, where the analysis and prediction took place. The included Jupyter notebook is a copy of the resulting notebook.

Scraper Setup and Usage

  1. Before running the scraper, you must first populate your own .env file. See .env.template for the required format, and see PRAW's documentation here for how to generate the required information. At a high level, you must create a Reddit account and corresponding application which is passed to the PRAW instance in order to query the Reddit API.

  2. Before you run the scraper, first create the directory data/ in the same directory where scraper.py is located. All data scraped will be placed here. It is recommended to add this directory to your .gitignore, as the generated files are quite large and will likely exceed Github's recommended file size.

  3. To run the scraper, run the command

    python scraper.py --subreddit politics --query trump --start-date 2016-09-01 --end-date 2016-11-01
    

    with your desired parameters. The scraper will create a new directory in data/ tagged with the current time, and will save both .csv and .pkl copies of the data. Both incremental and a final master dataset are generated. Run python scraper.py -h for more information.

Environment Requirements

Required Python Version: Python 3.8+

Required Python Packages:

# General Requirements
pandas
numpy

# Scraper
dotenv
praw
psaw

# Analysis and Prediction
pyspark
matplotlib
vaderSentiment
textBlob
sklearn