In 2013, I wrote an article called "Tweets Loud and Quiet" in which I argued that, although Twitter is sometimes characterized as a powerfully egalitarian, democratizing force, it actually resembles a traditional broadcast medium with a handful of very influential celebrities on one side and many millions of people with just a couple of followers on the other.
In order to write that article, I scraped metadata for about 400,000 Twitter accounts. Since then, it's been cited in a handful of scholarly papers, and I've gotten regular requests from researchers for access to my original dataset. I'm pleased to share my full dataset here.
This dataset was created between September 17, 2013, and October 19, 2013. Since I built it, Twitter has changed the way it assigns IDs, and a repeat of this project is now impossible (see below for why). Although this data is no longer a contemporary snapshot of Twitter usage patterns, it's still interesting as a way to understand social dynamics.
Until roughly 2015, Twitter gave every account a unique ID from a reasonably compact namespace. When I compiled this data in 2013, existing accounts had unique ID numbers that fell between roughly 0 and 1.9 billion.
I generated this dataset by polling Twitter's API for random account IDs within that range. It turns out that about 63% of integers between 0 and 1.9 billion were associated with accounts; Twitter's ID-assignment process skipped some numbers. Nevertheless, Twitter's ID assignment was strictly monotonic: given two accounts, the one with the higher ID number was always opened later than the other.
One other quirk: Twitter appears to have changed its account ID assignment algorithm around July 2012; before then, it assigned IDs at a density of 86%, and after then at 49%, skipping a large block of potential ID numbers in between.
Around 2015, Twitter switched from its quasi-sequential ID assignment scheme to one that resembles UUID. Now that Twitter accounts are assigned a 64-bit ID generated in part by hashing a timestamp, the namespace you'd need to search in order to repeat this project is 264 = 1.8 x 1019, or roughly as many possible IDs as there are insects on earth. I try to avoid writing things like this, but here I'm sure enough that it'd exceed some fundamental constraints of modern computing: it can't be done.
In this repo you'll find both a CSV file and an R dataset. They are identical.
Field | Remarks |
---|---|
account_id | Unique account ID |
handle | Twitter handle, e.g., @JonBruner |
name | Full account name, e.g., "Jon Bruner" |
description | Long account description, e.g., "Person who loves revelatory datasets that speak to eternal social characteristics." |
url | URL associated with account description |
language | ISO language code selected by user |
location | Free-text location provided by user |
account_created_at | Date and time of account creation |
account_created_at_interpolated | See below |
crawled_at | Date and time when account was scraped for inclusion in this dataset |
missing | 0: account exists 1: account never existed with this ID 2: account closed (by user, or by Twitter for abuse) |
protected | 0: not protected 1: protected |
followers_count | Number of Twitter accounts following this account |
following_count | Number of Twitter accounts that this account follows |
statuses_count | Number of public posts created by this account |
listed_count | Number of lists on which this account appears |
last_post_id | Unique ID of this account's most recent post, as of crawled_at |
last_post_text | Text of this account's most recent post |
last_post_lat | Latitude, if available, for this account's most recent post |
last_post_lon | Longitude, if available, for this account's most recent post |
last_post_place_id | Twitter place ID, if available, for this account's most recent post |
last_post_created_at | Date and time of this account's most recent post |
time_since_last_post | Convenience variable: the difference in hours between crawled_at and last_post_created_at |
For any account_id
that I polled, there could be one of three results:
- An account associated with the
account_id
exists - An account associated with the
account_id
existed at one time, but was closed, either voluntarily by the owner or (more commonly, I imagine) by Twitter for abuse - The
account_id
has never been assigned to an account
In case 1, I was able to retrieve an account_created_at
value. In case 3, the creation datetime is moot. In case 2, it might be interesting to know when the now-closed account was created, but Twitter's API doesn't return a creation datetime. So, I interpolated a creation datetime for each missing account on the assumption that Twitter's ID assignment scheme was roughly linear over small intervals. So, you'll see something like this:
account_id | account_created_at | account_created_at_interpolated |
---|---|---|
60453 | 2006-12-12 13:25:33 | 2006-12-12 13:25:33 |
63729 | - | 2006-12-13 11:39:21 |
72373 | 2006-12-15 22:18:44 | 2006-12-15 22:18:44 |
Twitter's developer terms now prohibit the use of its API "to monitor or measure the...usage statistics or results of Twitter Services," including for the compilation of "aggregate Twitter user metrics such as total number of active users, accounts, user engagements, or account engagements." However, they did not include this prohibition when I built the user dataset in 2013.
You may use this dataset for any purpose. If you do so, kindly cite Jon Bruner as the author of the data.