County level word and topic loading derived from a 10% Twitter sample from 2009-2015. Anonymized linguistic features extracted from over 1.5 billion English U.S County mapped tweets.
Read the full publication here.
Available in both csv format and as a MySQL dump. All tables are in sparse (long) format.
Approximately 24,000 most frequenct unigrams. All urls replaced with <URL>
and @-mentions replaced with <USER>
.
group_id
: County FIPS codefeat
: unigramvalue
: Number of times the unigram was used by the countygroup_norm
: Average number of times the feature was used by the county (value / number of users in county
)
Topic loadings per county using a set of 2000 topics captured in over 14 million Facebook status updates derived via Latent Dirichlet Allocation (LDA) (see full details on topic derivation). Topics, words per topic and conditional probailities available here.
group_id
: County FIPS codefeat
: Topic idvalue
: Number of times a word in the topic was used by the countygroup_norm
: Relative frequency of topic use by county
Twitter data was processed using the following rules:
- Each tweet was mapped to a U.S. County using tweet level latitude / longitude information and user level profile free text (full details here).
- Filtered for English using langid.
- Users with less than 30 tweets were removed.
- Counties with less than 100 users were removed.
Linguistic features process:
- Unigram relative frequencies extracted for each user.
- User level relative frequencies are averaged to the county.
- Topic loadings calculated using county level unigram relative frequencies.
Please cite the following paper if you use this data.
@inproceedings{giorgi2018remarkable,
title={The Remarkable Benefit of User-Level Aggregation for Lexical-based Population-Level Predictions},
author={Giorgi, Salvatore and Preotiuc-Pietro, Daniel and Buffone, Anneke and Rieman, Daniel and Ungar, Lyle H. and Schwartz, H. Andrew},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
year={2018}
}
Licensed under a GNU General Public License v3 (GPLv3).
Developed by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University.