Skip to content

Vowpal Wabbit LDA

arahuja edited this page Feb 19, 2014 · 2 revisions

Vowpal Wabbit LDA

Installing Vowpal Wabbit

Try to run the commands below, you make run into issues with your dev environment if you don't have some boost libraries installed.

Installing the following libraries might be useful in advance, if you have brew installed already.

brew install libtool
brew install automake
brew install boost

Then we can get the vowpal wabbit source and install it.

git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make install

Getting the data

Move the train.tsv file to your src/lessson15 dir

cd GADS7/src/lesson15
wget https://www.dropbox.com/s/wuagevrmu2jzq2h/train.tsv

Transforming the data

We need to transform the raw text into something vw can understand.

python parse_to_vw.py train.tsv

Running LDA

Vowpal Wabbit has implementations for logistic and linear regression, but also for a very fast online LDA implementation. The output for this is difficult to parse, but the model create good topics very quickly.

vw
-d data
--lda_alpha 0.1 \ #hyperparameter for prior distribution
--lda_rho 0.1 \ #hyperparameter for prior distribution
--lda_D 1980686 \
--minibatch 256 \
--power_t 0.5 \
--initial_t 1 \
-b 16 \
-p vw-predictions.dat \ # topic-document distributions 
--readable_model vw-topics.dat # topic-word distributions

Parsing the results

In vw-topics.dat we have the topic-word distributions unormalized. Each row represents a word and each column a topic. To find out the top words in a topic we simply sort a column for the highest values.

One way to evaluate a topic model is to determine if the top words in a topic form a coherent collection.