This project was created to fulfill the final project requirement for ECE 143 WI'21 at UC San Diego. This project uses the 'Fake News' Kaggle dataset (https://www.kaggle.com/c/fake-news/data) to visualize features of the data including most commonly used words, names, and country names in fake and real articles.
Authors: Akshay Gopalkrishnan, Bolun Liu, Pu Cheng, and Madison Wilson
- Python 3.7+
To recreate the environment that contains all the modules necessary to run the code, run 'conda env create -f environment.yml' in the terminal.
- Git clone 'ECE-143-Final-Project' repository
- Run 'conda env create -f environment.yml' in the terminal
- Run final_project.ipynb
All figures used in our final project presentation can be generated by running the final_project.ipynb file in the 'Code-and-Notebooks' folder. Supporting methods used within this file are described below.
Several custom methods were written to process, plot, and predict the fake news data. These files can be found in the 'Code-and-Notebooks' folder. Descriptions of each method are as follows:
geo.py: This file downloads and modifies a GeoPandas world file to count the number of mentions of each country in fake and real news articles. The returned value is a DataFrame containing country counts and geographic dimensions that can be used to plot a heat map of the world.
ml.py: This file contains all the methods necessary for the machine learning pipeline, including preprocessing text, creating and training the model, and graphing the training and validation accuracy. Also includes an interactive feature where the user can enter an article name to see whether it's real or fake.
most_common_names.py: This method extracts the name drops from the articles and builds the bar charts to visualize the number of mentions and names.
wordle.py: This file creates wordle (word could) based on the article text from the real/fake news file.
All data files used for this project can be found in the 'Datasets' folder. test.csv, test_data_labels.csv, and train.csv were downloaded from Kaggle https://www.kaggle.com/c/fake-news/data. The rest of the datasets were created by our group for this project. Descriptions of each dataset are as follows:
fake_name.csv: Stores all the name drops from fake articles.
new0.csv: Contains all the real news articles and their labels.
new1.csv: Contains all the fake news articles and their labels.
real_name.csv: Stores all the name drops from real articles.
test.csv: Contains all the real and fake news articles used for testing the machine learning model.
test_data_labels.csv: Contains the true value labels for the test data describing whether each article is real or fake.
train.csv: Contains all the real and fake news articles used for training the machine learning model.