- Unzip
emails.zip
to get a folder of all the Sarah Palin emails (thanks Sunlight Labs!). - Run
1-produce-lda-file.rb
to get a file ready in a format ready to be run through the LDA algorithm (thanks Stanford!). Specifically, this takes all the emails, collects them, and parses them into a single file ready to be used by2-training.scala
and3-infer.scala
. - Run
2-training.scala
(using thetmt-0.3.3.jar
file) to train the LDA model on Sarah Palin's emails, using the output of the previous step.java -Xmx1024m -jar tmt-0.3.3.jar -Dscalanlp.distributed.hub=socket://42-149-58-18.rev.home.ne.jp:53686/hub -Dscalanlp.distributed.id=/tmt/8 edu.stanford.nlp.tmt.TMTMain "2-training.scala"
should work. This outputs a folder which contains, among other things, a trained LDA model. - Run
3-infer.scala
(using thetmt-0.3.3.jar
file) to perform topic inference.java -Xmx1024m -jar tmt-0.3.3.jar -Dscalanlp.distributed.hub=socket://42-149-58-18.rev.home.ne.jp:53686/hub -Dscalanlp.distributed.id=/tmt/8 edu.stanford.nlp.tmt.TMTMain "3-infer.scala"
should work. Specifically, this takes the topic model learned in the previous step and applies it to the file produced by1-produce-lda-file.rb
, and outputs a folder containing information on topics and the topic distributions of each document.
Earlier this month, several thousand emails from Sarah Palin's time as governor of Alaska were released. The emails weren't organized in any fashion, though, so to make them easier to browse, I've been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.
It's still in the works, but I threw up a simple demo app (which I hope to improve on, once I find the time) to view the organized documents here.
Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.
For example, given the sentence "I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car", an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).
If you're familiar with latent semantic analysis, you can think of LDA as a generative version.
Here's a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)
- Wildlife/BP Corrosion: game, fish, moose, wildlife, hunting, bears, polar, bear, subsistence, management, area, board, hunt, wolves, control, department, year, use, wolf, habitat, hunters, caribou, program, denby, fishing, …
- Energy/Fuel/Oil/Mining: energy, fuel, costs, oil, alaskans, prices, cost, nome, now, high, being, home, public, power, mine, crisis, price, resource, need, community, fairbanks, rebate, use, mining, villages, …
- Trig/Family/Inspiration: family, web, mail, god, son, from, congratulations, children, life, child, down, trig, baby, birth, love, you, syndrome, very, special, bless, old, husband, years, thank, best, …
- Gas: gas, oil, pipeline, agia, project, natural, north, producers, companies, tax, company, energy, development, slope, production, resources, line, gasline, transcanada, said, billion, plan, administration, million, industry, …
- Education/Waste: school, waste, education, students, schools, million, read, email, market, policy, student, year, high, news, states, program, first, report, business, management, bulletin, information, reports, 2008, quarter, …
- Presidential Campaign/Elections: mail, web, from, thank, you, box, mccain, sarah, very, good, great, john, hope, president, sincerely, wasilla, work, keep, make, add, family, republican, support, doing, p.o, …
And here's a sample email from the wildlife topic:
I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it's precisely a wildlife-based protest against Palin as a choice for VP:
In a future post, I'll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there's a spike in April 2008 -- exactly (and unsurprisingly) the month in which Trig was born.