Text Mining South Park

South Park follows four fourth grade boys (Stan, Kyle, Cartman and Kenny) and an extensive ensemble cast of recurring characters. This analysis reviews their speech to determine which words and phrases are distinct for each character. Since the series uses a lot of running gags, common phrases should be easy to find.

The programming language R and packages tm, RWeka and stringr were used to read South Park episode transcripts from a repository, attribute them to a certain character, break them into ngrams, calculate the log likelihood for each ngram/character pair, and rank them to create a list of most characteristic words/phrases for each character. The results were visualized using ggplot2, wordcloud and RColorBrewer.

Data

Complete transcripts (70,000 lines amounting to 5.5 MB) were downloaded from BobAdamsEE's github repository SouthParkData from the original source at the South Park Wikia page.

Log Likelihood

Each corpus was analyzed to determine the most characteristic words for each speaker. Frequent and characteristic words are not the same thing - otherwise words like "I", "school", and "you" would rise to the top instead of unique words and phrases like "professor chaos", "hippies" and "you killed kenny."

Log likelihood was used to measure the unique-ness of the ngrams by character. Log likelihood compares the occurrence of a word in a particular corpus (the body of a character's speech) to its occurrence in another corpus (all of the remaining South Park text) to determine if it shows up more or less likely that expected. The returned value represents the likelihood that the corpora are from the same, larger corpus, similar to a t-test.

Read the full report

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
plots		plots
scripts		scripts
.gitignore		.gitignore
README.md		README.md
southpark_loglikelihood.Rmd		southpark_loglikelihood.Rmd
southpark_loglikelihood.pdf		southpark_loglikelihood.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Mining South Park

Data

Log Likelihood

About

Releases

Packages

Contributors 3

Languages

walkerkq/textmining_southpark

Folders and files

Latest commit

History

Repository files navigation

Text Mining South Park

Data

Log Likelihood

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages