Twitter-Credibility-Test-with-Big-Data

Background: The objective of this project is to identify whether Twitter can be considered a credible source of information, which reflects the emergence of important trends or topics in education.

Data Source: Approximately 100 Million Tweets about education are extracted from Twitter API from April 2022 to November 2022.

Tools and Platform: Performed on Google Cloud Platform, with PySpark (packages: spark.sql and MLlib, codes for MLlib will be updated at version 2.0), and Python (packages: pandas, seaborn, numpy, scikit-learn, geopandas, etc.)

Structure

Data preparation and cleaning
EDA (Exploratory Data Analysis)
Influential Analysis for usr account
Location and Time-series Analysis
Text Duplication Analysis with Jaccard Distance & LSH test on text similarity

Preview

Filtering & Cleanup: We intentionally control the number of filtered tweets (about 7.4 million) by filtering topics such as racial equality, literacy, tech & digital, special need, school curriculum, higher education to see if the tweet reflect the trending real-world online discussion*. Also, only tweets in English (lang=’en’) are considered.

Location Analysis: The geological distribution of tweets within our period of data collection indicate features (very-likely) caused by shocking news incidents of the school shooting in 05/24/2022 at Robb Elementary School in Texas had risen a huge online disgruntled in school gun control.

Time Analysis One of our findings is about time and the spike of tweet volume. Recall that we included certain heated topics while filtering the data and EDA. There is a obvious spike in the U.S.’s tweets in August, 2022. The corresponding news is that the Biden Administration announced a $10 Billion student debt relief. This was a series of news announcements in August, which could explain part of the spike in tweets.

LSH Similarity Given the limitation on and our selection (during data cleaning) of the tweets' length, this is a text similarity tes rather than topic modeling (LDA/LSA). It turned out that all organizations had a high percentage of unique tweets, which indicates that they are posting original content. News & Media had the least percentage of duplicate content, which indicates that they might be sharing similar education information. Schools and NGOs also have lower results of unique content.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Codes		Codes
IMAGE		IMAGE
Final Project Report.pdf		Final Project Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter-Credibility-Test-with-Big-Data

Structure

Preview

About

Releases

Packages

Languages

hjiangAnthony/Twitter-Credibility-Test-with-Big-Data

Folders and files

Latest commit

History

Repository files navigation

Twitter-Credibility-Test-with-Big-Data

Structure

Preview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages