Author: Jae Yeon Kim (jkim638@jhu.edu)
Paper: https://osf.io/preprints/socarxiv/dvm7r/ (accepted at Perspectives on Politics)
Session information
- Programming languages
- R version 4.0.4 (2021-02-15)
- Python 3.8.8
- Bash 5.1.4(1)-release
- Operation system
- Platform: x86_64-pc-linux-gnu (64-bit)
- Running under: Ubuntu 21.04
Raw data: tweet_ids
The data source is the large-scale COVID-19 Twitter chatter dataset (v.15) created by Panacealab. The original dataset only provided tweet IDs, not tweets, following Twitter's developer terms. I turned these tweet IDs back into a JSON file (tweets) using Twarc. This process is called hydrating and is very time-consuming. To ease the process, I created an R package, called tidytweetjson, that efficiently parses this large JSON file into a tidyverse-ready data frame. To help replication, I also saved the IDs of the tweets by typing the following command in the terminal: grep "INFO archived" twarc.log | awk '{print $5}' > tweet_ids
Replication code
-
00_setup.sh: Shell script for collecting Tweets and their related metadata based on Tweet IDs
-
01_google_trends.r: R script for collecting Google search API data
-
01_sample.Rmd: R markdown file for sampling Twitter data
-
02_parse.r: R script for parsing Twitter data. This script produced a cleaned and wrangled data named 'parsed.rds.' This file is not included in this repository to not violate Twitter's Developer Terms. Also, its file size is quite large (1.4 GB).
Replication code
-
03_explore.Rmd: R markdown file for further wrangling and exploring data. This file creates Figure 2. (overall_trend.png)
-
04_01_hashtags.R: R script file for creating a wordlcoud of hashtags. This file creates Figure 1. (hash_cloud.png)
-
04_clean.ipynb: Python notebook for cleaning texts
Replication code
- 05_topic_modeling.Rmd: R markdown for topic modeling analysis. This file creates Figure 3 (dynamic_topic_day.png)