Text source | Information |
"Alice in Wonderland" | Alice in Wonderland (Ch.1) |
"Romeo and Juliet" | Romeo and Juliet |
"Bhagavad Gita" | Bhagavad Gita |
"Memento screenplay" | Memento screenplay |
"100K tweets" | 100,000 tweets from: Sentiment140 dataset training data |
"20K tweets" | 20,000 tweets from Gender Classifier Data |
"MASC tweets" | MASC tweets (cleaned of html markup) |
"MASC spoken" | MASC spoken transcripts (phone and face-to-face: 25,783 words) |
"COCA blogs" | Corpus of Contemporary American English blog samples |
"Google website" | Google homepage (accessed 10/20/2020) |
"Software languages" | "Tower of Hanoi" (programming languages A-Z from Rosetta Code) |
"Monkey text" | Ian Douglas's English-generated monkey0-7.txt corpus |
"Coder text" | Ian Douglas's software-generated coder0-7.txt corpus |
"iweb cleaned corpus" | First 150,000 lines of Shai Coleman's iweb-corpus-samples-cleaned.txt |
Reference for Monkey and Coder texts: Douglas, Ian. (2021, March 28). Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.4642460