Skip to content

This was my first NLP team project. We used Twint to scrape Twitter information and used this data to perform Sentiment/Emotion Analysis for the Presidential candidates in the last four elections. Other key resources such as TextBlob, NLTK, MongoDB and Python were also used to complete this analysis.

Notifications You must be signed in to change notification settings

GR8505/US_Election_NLP

Repository files navigation

Twitter and US Election Results

by Dillon Hamilton, Kevin Connolly, Sebastian Lopez and Greg Bhola


A study using Natural Language Processing (NLP)

drawing

Executive Overview


Based on the results, there is no concrete evidence that justifies Twitter is a good proxy for determining US Election outcomes. This project focused on Sentiment/Emotion Analysis and Twitter Engagement to determine whether this form of Social Media is effective in predicting US election results.

Sentiment Analysis

Watch the video Click above image to view video of Dashboard.

For powerBI users, view dashboard.


Sentiment Score => The Average Sentiment Score for each candidate in which 1 is the most Positive Sentiment, -1 is the most Negative Sentiment and 0 is Neutral Sentiment.

Sentiment Analysis => Looks at the percentage of tweets for each candidate that are Positive Sentiment, Negative Sentiment and Neutral Sentiment.

Emotion Analysis => This will analyze the list of words used in the tweets and link them to a particular emotion. Refer to emotions.txt for list of words and their linked emotions.


Twitter Engagement

This analysis gauges whether some of the popular Twitter metrics such as, Likes, Replies and Retweets are reliable indicators for determining who wins the race to the White House. The Tweet Length feature was created during the Transformation process.


Key Insights

  1. All winners recorded higher Average Sentiment scores than their opponents
  2. All winners registered a larger percentage of positive tweets compared to their rivals, with 2012 being the only exception
  3. Each winning candidate had a lower percentage of tweets with negative sentiment than their opponents
  4. Emotional Analysis showed no clear patterns
  5. Twitter Engagement results was inconclusive:
    • In 2008 and 2016, the losers recorded higher Average Likes scores, however, in 2012 and 2020 the winners had higher Average Likes scores
    • In all years, the winners registered higher Average Replies scores
    • Average Retweets scores were higher for all winners, except in 2016
    • Average Tweet Length was higher for the winners in 2016 and 2020 but not in the other two years

Resources


  • Twint
  • Python
  • MongoDB
  • Power BI

Data Acquisition


  • Scraped Twitter using Twint via Anaconda Environment
  • Used the first and last names of each US Presidential candidate as the key search words Eg. twint -s "Joe Biden" --since "2020-10-15 17:00:00" --until "2020-10-16 17:00:00" --lang en -o biden3_2020.csv --csv twint -s "Donald Trump" --since "2020-10-15 17:00:00" --until "2020-10-16 17:00:00" --lang en -o trump3_2020.csv --csv
  • Scraped data on the day following each of the three mandatory Presidential debates for consistency
  • Scraped for tweets in the English language

Data Storage


  • Datasets were stored in MongoDB database
  • Created a MongoDB connection using Python to call on each dataset.
    See the following jupyter notebook for more details.

Data Preprocessing


  • Kept the following features
    • tweet
    • likes_count
    • retweets_count
    • replies_count
  • Cleaned tweets using Regular Expressions (Regex)
  • Created tweet_length feature (which measures the number of words in each tweet) in the following jupyter notebook

Machine Learning - Natural Language Processing (NLP)


  • Used the TextBlob library to perform Sentiment Analysis

    • Sentiment and Subjectivity scores were obtained for each tweet
    • Each tweet was ranked as Positive, Negative or Neutral Sentiment based on Sentiment scores
  • Obtained the key words (minus stop words) to construct word cloud Watch the video

  • Used the NLTK library to perform Emotion Analysis

Limitations


  • NLP is not 100% accurate in measuring sentiment, as it is unable to read sarcasm or wittiness.
  • The location of the tweets were not revealed, so we have no idea if American citizens made these comments.
  • Twitter came out in 2006, so there was a dearth of Twitter data in 2008 and 2012. Back then, it was not the social media monster it is today.

References



About

This was my first NLP team project. We used Twint to scrape Twitter information and used this data to perform Sentiment/Emotion Analysis for the Presidential candidates in the last four elections. Other key resources such as TextBlob, NLTK, MongoDB and Python were also used to complete this analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published