<<< Previous | Back to beginning >>>
One subset of unsupervised learning tasks are topic extraction tasks, where the aim is to find common groupings of items across collections of items. One method of doing so is Latent Dirichlet allocation (LDA). Latent Dirichlet Allocation is a way to model how topics are distributed over a corpus and words are distributed over a set of topics.
In broad strokes, LDA extracts hidden (latent) topics via the following steps:1, 2
- Arbitrarily decide that there are 10 topics.
- Select one document and randomly assign each word in the document to one of the 10 topics.
- Repeat step 2 for all the other documents. This results in the same word being assigned to multiple topics.
- Compute the following:
- How many topics are in each document?
- How many topic assignements are due to a given word?
- Take one word in one document and reassign it to a new topic and then repeat step 4.
- Repeat step 5 until the model stabilizes such that reassigned topics do not change distributions.
LDA yields a set of words associated to each topic (see step 4, part 2) and the mixture of topics associated to each document (see step 4, part 1).
This image, inspired by Christine Doig's PyTexas 2015 "Introduction to Topic Modeling" presentation, can help explain the process:
Let's do topic modeling with sklearn
. One of the best things about sklearn
is the simplicity of its syntax. To do topic modeling with sklearn
, follow these five steps (the function names remain the same, regardless of the algorithm you use!):
In this example, we will be using the Latent Dirichlet Allocation algorithm.
from sklearn.decomposition import LatentDirichletAllocation
When creating an instance of sklearn
's LatentDirichletAllocation
algorithm to run on our data, we need to set parameters. n_components
is the number of topics in the dataset and we set random_state
to 42
so that this notebook is reproducible. Since the sentences happen to already have labels (either news or romance), lets see if LDA can also find those separations by setting the number of topics to 2
.
num_topics = 2
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
Using the lda
object we set up above, we now apply (fit
) the LDA algorithm to the bag of words we extracted from our sentences and had stored in the tf
sparse matrix.
lda.fit(tf)
The result will look something like this:
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_components=2, n_jobs=1,
n_topics=None, perp_tol=0.1, random_state=42,
topic_word_prior=None, total_samples=1000000.0, verbose=0)
We now want to model the documents in our corpus in terms of the topics discovered by the model. This is done using the .transform
method of LDA. This function yields the distribution of topics across the documents. The document_topic
array contains the percentages of each topic found in each document.
document_topic = lda.transform(tf)
Then we visualize how much of each document is each topic—for example that document 1 is 10% topic A and 25% topic b. We choose an area chart because each band of the chart maps to a different category (in this case a unique topic). The width of each band in relation to the others illustrates how much of the document is thought to be about that topic relative to the others.
%matplotlib inline
import matplotlib.pyplot as plt
from cycler import cycler
import numpy as np
colors = ['tab:green', 'tab:pink']
topics = np.arange(10)
num_docs = document_topic.shape[0]
fig, ax = plt.subplots(figsize=(15,5))
_ = ax.stackplot(range(num_docs), document_topic.T, labels=topics, colors=colors)
_ = ax.set_xlim(0, num_docs)
_ = ax.set_ylim(0,1)
_ = ax.set_yticks([])
_ = ax.set_xlabel("document")
_ = ax.legend(title="topic", bbox_to_anchor=(1.06, 1), borderaxespad=0)
fig.savefig("images/doc_topic.png", bbox_inches = 'tight', pad_inches = 0)
lda.components_
is an array where each row is a topic, and each column roughly contains the number of times that word was assigned to that topic, which is also the probability of that word being in that topic. To figure out which word is in which column, we use the get_feature_names()
function from CountVectorizer
. The argsort
function is used to return the indexes of the columns with the highest probabilities, which we then map into our collection of words. Here we print the top 5 words in each topic.
num_words = 10
topic_word = lda.components_
words = np.array(tf_vectorizer.get_feature_names())
for i, topic in enumerate(topic_word):
# sorting is in descending, so ::-1 reverses to ascending
sorted_idx = topic.argsort()[::-1]
print(i, words[sorted_idx][:num_words])
0 ['said' 'like' 'time' 'just' 'll' 'way' 'didn' 'new' 'president' 'thought']
1 ['mrs' 'said' 'home' 'little' 'year' 'day' 'good' 'new' 'got' 'right']
We can also visualize these topics as lists sized by the frequency of the word and colored by the topic, as proposed by Allan Riddell in Text Analysis with Topic Models for the Humanities and Social Sciences:
# Set the font size for word with largest share in corpus
fontsize_base = 40/ np.max(topic_word)
fig, ax = plt.subplots(figsize=(15, 2), constrained_layout=True)
for i, topic in enumerate(topic_word):
top_idx = topic.argsort()[::-1][:num_words]
top_words = words[top_idx]
top_share = topic[top_idx]
for j, (word, share) in enumerate(zip(top_words, top_share)):
ax.text(j, i/4, word, fontsize=fontsize_base*share, color=colors[i])
# Stretch the x-axis to accommodate the words
ax.set_xlim(0, num_words)
ax.set_ylim(-.2, i/4+.2)
ax.axis('off')
fig.savefig("images/word_topic.png", bbox_inches = 'tight', pad_inches = 0)
One method of evaluating a model is to compute the chance (probability) of the data we observed showing up in a dataset generated by the model. First we start with the modeled probability density function, which is the theoretical distribution of all topics in our model. We then use the log likelihood and the perplexity functions to evaluate the average odds of our observations occuring in the modeled distribution of words and topics.
Evaluate the success rate of the model by computing the
- score: approximate log likelihood—the higher the better
- perplexity: exponent of the negative log likelihood—the lower the better
print(f'Approximate Log Likelihood: {lda.score(tf)}')
print(f'Perplexity: {lda.perplexity(tf)}')
This should generate the following output:
Approximate Log Likelihood: -657835.3569726176
Perplexity: 8218.638504839773
We can compare the results of our topic modeling to the labels we already have for the data. First we need to assign a label to each document based on which topic is most prevalant, which we can do using the argmax
function since it returns the index (which maps directly to the topic) of the cell with the highest value. We then compare these topic based classes to the labels in our dataset. Given the sentences for each topic, we will make the assumption that topic 0 is news and topic 1 is romance. We canargmax
returns the index of the
# Get the location of the highest value in each column
topic_class = document_topic.argmax(axis=1)
topic_labels = np.empty(topic_class.shape, dtype=object)
topic_labels[topic_class==0] = 'news'
topic_labels[topic_class==1] = 'romance'
topic_labels
This should generate the following output:
array(['news', 'news', 'news', ..., 'news', 'news', 'romance'], dtype=object)
We can now use a confusion matrix to see if there is overlap between the topics and the labels. In a confusion matrix, the data is the counts of true positive, false positive, false negative, and true negative labeing. The confusion matrix can be visualized as a table with predictions on the y-axis and the positives on the x-axis:
actual news | actual romance | |
---|---|---|
predicted news | ||
predicted romance |
from sklearn.metrics import confusion_matrix
confusion_matrix(df['label'], topic_labels)
The result is the following:
actual news | actual romance | |
---|---|---|
predicted news | 2480 | 2143 |
predicted romance | 2540 | 1891 |
Unfortunately LDA doesn't seem to work all that well for this dataset. And nothing about the topics indicates a distinction between the romance and news texts... but we already saw that they didn't seem to be all that separable. Can we get better results by expanding the corpus to include more texts of other types? Or by expanding each document so that it is longer than a sentence?
Since topic modeling works better with longer texts, what topics do you get if you try to model:
- Moby Dick
- Pride and Prejudice
- Both together?
- A contemporary text like The Hunger Games
1 Introduction to Latent Dirichlet Allocation by Edward Chen.
2 The LDA Buffet is Now Open by Matthew Jockers.