This repository contains documentation and scripts for the Reddit Politosphere, a pseudonymized, large-scale text and network resource of online political discourse based on the Pushshift Reddit Dataset.
The Reddit Politosphere covers 605 political subreddits between 2008 to 2019. For each year, it contains:
- all comments posted in the political subreddits together with metadata such as creation time
- networks with the political subreddits as nodes and edges computed on the basis of user overlap
We also release metadata for subreddits and users.
We provide scripts for easy data access:
load_comments.py
,load_comments.sh
: load comments for specific years and subredditsload_networks.py
: load networks for specific years
The comment files comments_YYYY-MM.bz2
contain all comments posted in the
political subreddits between 2008 and 2019. The data fields are identical to the
Pushshift Reddit Dataset. The author names are converted to random five-character pseudonyms. We add the following two data fields:
body_cleaned
: a tokenized, lower-cased, and cleaned version of the comment bodylanguage
: the language of the comment as detected by CLD2
The network files networks_YYYY.csv
contain the weighted and unweighted
networks between 2008 and 2019. The weighted networks
have edge weights corresponding to the number of users that posted at least 10 comments
in both subreddits, excluding bots and automoderators. The unweighted networks
are created by applying statistical network backboning,
specifically the noise-corrected filter, to the
weighted networks. Intuitively, a large weight between
two large subreddits is less indicative of latent associations between the subreddits
than a large weight between two small subreddits.
The noise-corrected filter takes such effects into account when
converting the weighted into an unweighted network.
The files have the following data fields:
node_1
,node_2
: nodes incident to the undirected edgeweighted
: edge weight in the weighted networkunweighted
: whether or not the edge exists in the unweighted network
The subreddit metadata file subreddits_metadata.json
lists selected properties of the
political subreddits. Specifically, it has the following data fields:
subreddit
: name of subredditbanned
: whether or not subreddit has been banned by 2022gun
: subreddit with focus on gun controlparty
: explicit affiliation with democraticdem
or republicanrep
party`politician
: subreddit devoted to a politicianregion
: Canadaca
, Europeeu
, Middle Eastme
, UKuk
, US statesus
or other regionsworld
The subreddit metadata file users_metadata.json
lists selected properties of the
users (who otherwise are fully pseudonymized). Specifically, it has the following data fields:
author
: pseudonymized usernameautomoderator
: whether or not user is automoderator (for filtering)bot
: whether or not user is bot (for filtering)gender
: username containing malem
or femalef
given name
It further provides information about the presence of frequent classes of lexical elements in the usernames:
angry
: negative attitude (angry, rogue, troll, wtf)anti
: overt negation (anti, downvote, fuck, stop)astro
: astro theme (astro, cosm, rocket, space)dangerous
: dangerous animal (beast, gorilla, shark, tiger, wolf)doom
: doom theme (dead, death, doom, evil, zombie)military
: military title (c(a)pt, colonel, commander, major, sgt)nobility
: title of nobility (duke, emperor, king, lord, sir)trump
: reference to Donald Trump (trump)
Please cite the following paper when using data from the Reddit Politosphere:
@inproceedings{hofmann2022politosphere,
title = {The {R}eddit {P}olitosphere: A Large-Scale Text and Network Resource of Online Political Discourse},
author = {Hofmann, Valentin and Sch{\"u}tze, Hinrich and Pierrehumbert, Janet},
booktitle = {Proceedings of the International AAAI Conference on Web and Social Media 16},
year = {2022}
}