[WIP][DNM] Visualize topic model difference (need feedback) #1243

menshikh-iv · 2017-03-26T17:57:30Z

I think the gensim library is not enough for the visualization of models. This problem motivates me to begin work in this direction.
I see two important cases in this field (with dictionary and topic matrix, for any topic models):

Case 1: Difference between two models

I have two topic models and the question is "What is a difference between this?". I consider a model is a matrix Theta (Topic x Dictionary). It is necessary to calculate how "similar" this two models and what is "similarity" and "difference". The difference between models is well described by the difference between their topics. Use this idea, I construct matrix topic X topic for two models and matrix[topic_i][topic_j] describe what is a difference between this topics. For this purpose, I used some "distance functions" like KL, Hellinger, and Jaccard. So, for annotating matrix[topic_i][topic_j] I used intersection and the symmetric difference between top_n words from each topic.

This approach allows you to see how different the models are. Also, we see specific words for all topic pairs.

Case 2: One-by-one difference between models in train process

The train of the model takes a very long time. I keep model to disk every N documents. The question is "How to understand if you need to continue train model or model already convergence and there is no point to continue train process" and "How to see what happens with the model during training".

For solve this, I train LDA model and dump model every N documents. I construct matrix (num_topics, models_count), ordered by training time, where matrix[topic_i][time_j] represent difference for topic_i between previous and next model in training time time_j. This process shows well what happens in training process. We can see that the model converges (or not) and "trash topics" (topics that are constantly changing)

Also, I plot sum(diff_between_two_models) of this matrix and compare it with perplexity and coherence (u_mass).
As a result, I noticed that this approach work better that perplexity with anomaly situations (something wrong with the model, but perplexity does not change) and this approach correlate with coherence, but it much faster and simpler.

We can see this solution in current commits (warning: plot may not be displayed on GitHub, so you should open html version of notebook)

I would like to see your suggestions and comments (@piskvorky and @tmylk)

P/S

The next step is to work with the code of models (BaseModels or something else) for collect the necessary data from models during training and calculate stuff on the fly.

In addition, the plans include:

Deeper introspection of models (using external corpus like pyLDAvis or Termite)
Work with visualization (perhaps a web-application or something)

tmylk · 2017-03-28T14:13:00Z

btw there is a termite integration in https://github.com/baali/TopicModellingExperiments

menshikh-iv · 2017-03-29T03:45:56Z

@tmylk thank you

…ence))

tmylk · 2017-05-08T08:56:06Z

gensim/model_difference.py

@@ -0,0 +1,80 @@
+from random import sample
+import numpy as np


Please add this to LDAModel or BaseTopicModel class

tmylk · 2017-05-10T09:57:18Z

Please add more explanatory text and split topic diff visualisation into two:

topic i vs topic j. upper diagonal matrix(without the diagonal). Want to be different.
topic i vs topic i. just the diagonal. Want to be the same.

HarryBaker · 2017-05-11T16:51:50Z

I am working on a similar project that I think ties in with topic2topic_difference. What I am working on is validating that my LDA models are reproducible. That is, I want to prove that if I were to create an identical model under the same parameters, it would be almost identical to the original model. This is to show that my topics are not a result of chance, but that they are accurate representations of the training corpus. Could I apply the output of topic2topic_difference to this goal?

menshikh-iv · 2017-05-12T05:47:23Z

@HarryBaker yeah, you can use topic2topic_difference for this purposes. You can choice more suitable metrics (for example distance="jaccard" or another)

But remember that topics can change places (for example in the first model t1, t2, t3, in second t3, t1, t2).
In this case, the difference between topic-word matrix will be significant, but topics can be identical.

If you fix topic-order, you will not have problems with this approach, otherwise, you should work with permutations (by topics) of a topic-word matrix.

HarryBaker · 2017-05-15T14:28:51Z

My plan was to design a script that would compare two models, and then try to match corresponding topics based on their similarity score. That way you would identify the most obvious matches first, and then calculate the degree of dis-similarity of the more junk topics. If the models are actually similar I should expect to see that there would not be as many dissimilar junk topics.

menshikh-iv · 2017-05-15T14:47:01Z

@HarryBaker It's nice, I will write this method for LdaModel at this week

HarryBaker · 2017-05-15T14:56:31Z

I'm going to work on it over the next few days. Can I send in a pull request if I make significant progress?

menshikh-iv · 2017-05-15T15:21:52Z

@HarryBaker yes

HarryBaker · 2017-05-15T15:48:20Z

Ok, I have a few questions about topic2topic_difference().

In line 61 you have the chained assignment:

        z[topic1][topic2] = z[topic2][topic1] = distance_func(d1[topic1], d2[topic2])

I'm having trouble understanding why you assign z[topic2][topic1] = distance_func(d1[topic1], d2[topic2]). Given two distinct LDA models, distance_func(d1[topic1], d2[topic2]) would be different from distance_func(d1[topic2], d2[topic1]). That is, topic 4 in model 1 might be identical to topic 6 in model 2, but that does not mean that topic 6 in model 1 is identical to topic 4 in model 2.

Is this what you meant that "topics can change places"?

tmylk · 2017-05-15T21:07:17Z

@HarryBaker In the next version of the code only the upper triangle topic1> topic2 will be shown to avoid this confusion.

HarryBaker · 2017-05-15T21:28:56Z

I might just be misunderstanding what this code is intended for. I think I'm using it for a slightly different purpose than what it was designed for, because I'm using it to compare 2 completely distinct LDA models. My goal is to create N many duplicate LDA models under the same parameters, and then use topic2topic_difference to show how similar they are. My goal is to prove that the LDA models I produce are similar--and are thus reproducible. I'm working in biomedical research, so quantitatively proving reproduciblility is very important.

However, from what I understand it sounds like this code was intended to compare the same topics of a single LDA model across different iterations of training. Am I correct?

menshikh-iv · 2017-05-16T03:53:09Z

@HarryBaker Yes, I work with models from different iterations

HarryBaker · 2017-05-16T14:42:20Z

Would you consider adding a method that is intended to compare two distinct models? I think it would be very helpful for certain projects. It would allow you both to validate models (as I am currently doing), as well as compare similar models that are not identical. For instance, in a previous project I was studying multiple models created from the same dataset, but over different periods of time. A method to compare two distinct models would have been helpful to match topics over time periods. I was working on something similar, but your code is much simpler and more compact.

tmylk · 2017-05-16T16:50:42Z

@HarryBaker If you wish to write a new method to compare two models trained on exactly the same data but with a different random seed that would be welcome. Please create another issue for that. However imho it will need some kind of alignment, say on top 10 words, to suggest that topic 5 became topic 10 with another random initialization.

HarryBaker · 2017-05-16T17:41:47Z

Ok, I will do that.

What do you mean by alignment? I have been using Jaccard distance between the top 15 words (using the code in model_difference.py) and have been getting good results. It's very similar to the results I've gotten using KL divergence, but Jaccard works slightly better.

tmylk · 2017-05-16T23:40:17Z

@HarryBaker replied in #1328

menshikh-iv · 2017-05-17T15:50:12Z

@HarryBaker from my experience, Jaccard is more stable and robust (unlike KL or Hellinger). But Jaccard is not sensitive enough for some tasks.

HarryBaker · 2017-05-17T20:39:39Z

I agree. The big issue I am trying to eliminate is word chaining in topics, where two distinct groups of words are assigned to the same topic because they have one word in common. For instance, in the corpus I'm studying I've noticed that words about breast cancer and words about pregnancy are often assigned to the same topic, because they share "woman" as a word. KL isn't appropriate for comparing topics here because highly probable words might get wrongfully chained into a topic. Jaccard does a much better job for this specific task.

menshikh-iv · 2017-05-18T15:05:24Z

@HarryBaker Could you try new code from 'develop' branch PR 1334?

HarryBaker · 2017-05-22T20:42:42Z

I think that my application of your code is different than how you intended it. I have a modified version of your code in my fork. Is your code used during the training of an LDA model?

menshikh-iv · 2017-05-23T03:46:37Z

@HarryBaker

Is your code used during the training of an LDA model?

If you means a second case (One-by-one difference between models in train process) It will be a next step, coming soon (:

I think that my application of your code is different than how you intended it

Check current version, I rewrote the code a bit to realize some of your wishes

tmylk · 2017-05-23T17:28:16Z

@menshikh-iv why is it still empty in http://nbviewer.jupyter.org/github/menshikh-iv/gensim/blob/de1c667a9702fddeef166c8ff6b8c14cb4206cdc/docs/notebooks/model_difference.ipynb ?

Could you please add a png image to the notebook AND also link to the HTML version of the latest notebook? so that people can see the awesome viz.

Then will merge

HarryBaker · 2017-05-23T20:15:53Z

Yup, those are the changes that would have been necessary for me.

I'm not sure if it helps you, but the code in my fork (which I link to here: #1328) I wrote some functions to match topics between two models, and then find the average similarity. It gives the option to enforce a bijection between topics as well.

piskvorky · 2017-05-29T03:59:38Z

This sounds like an awesome feature! What's the status here?

menshikh-iv · 2017-06-22T08:49:43Z

The first case now in 2.2.0 (#1374 and #1334), the second case will be a part of #1396 and #1399

tmylk and others added 27 commits November 5, 2015 19:07

Merge branch 'release-0.12.3rc1'

1c63c9a

Merge branch 'release-0.12.3'

280a488

Merge branch 'release-0.12.3'

ddeb002

Update CHANGELOG.txt

f2ac3a9

Update CHANGELOG.txt

cf09e8c

resolve merge conflict in Changelog

b61287a

Merge branch 'release-0.12.4' with piskvorky#596

3ade404

Merge branch 'release-0.13.0'

9e6522e

Merge branch 'release-0.13.0'

87c4e9c

Release version typo fix

9c74b40

Merge branch 'release-0.13.0rc1'

7b30025

Merge branch 'release-0.13.0'

de79c8e

Merge branch 'release-0.13.1'

d4f9cc5

Merge branch 'release-0.13.2'

d8e9c0f

Merge branch 'release-0.13.2'

7c118fc

Merge branch 'release-0.13.3'

432f840

Merge branch 'release-0.13.3'

b42e181

Win and OSX build fix

3067cb0

Merge branch 'release-0.13.4'

e838391

Merge branch 'release-0.13.4.1'

5d47ec4

Merge branch 'release-1.0.0rc1'

a18de8d

Typo in version

67b1a17

Fix merge conflict

df13670

Merge remote-tracking branch 'upstream/develop' into develop

160e4d0

Add notebooks

461e193

delete unnecessary logging output

4ce71d3

Add html version with visualization (if github can't render plotly plot)

4794294

reverse diff values & colors (plots show (difference), no (1 - differ…

d0b53b2

…ence))

tmylk reviewed May 8, 2017

View reviewed changes

HarryBaker mentioned this pull request May 16, 2017

Calculate Similarity of Distinct LDA Models #1328

Closed

piskvorky assigned menshikh-iv May 29, 2017

menshikh-iv mentioned this pull request May 30, 2017

Lda difference visualization #1374

Merged

tmylk mentioned this pull request Jun 2, 2017

Add coherence and Diff logging for LDA #1381

Closed

menshikh-iv closed this Jun 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DNM] Visualize topic model difference (need feedback) #1243

[WIP][DNM] Visualize topic model difference (need feedback) #1243

menshikh-iv commented Mar 26, 2017 •

edited

Loading

tmylk commented Mar 28, 2017

menshikh-iv commented Mar 29, 2017

tmylk May 8, 2017 •

edited

Loading

tmylk commented May 10, 2017

HarryBaker commented May 11, 2017

menshikh-iv commented May 12, 2017 •

edited by piskvorky

Loading

HarryBaker commented May 15, 2017

menshikh-iv commented May 15, 2017

HarryBaker commented May 15, 2017

menshikh-iv commented May 15, 2017

HarryBaker commented May 15, 2017

tmylk commented May 15, 2017

HarryBaker commented May 15, 2017

menshikh-iv commented May 16, 2017

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

menshikh-iv commented May 17, 2017

HarryBaker commented May 17, 2017

menshikh-iv commented May 18, 2017

HarryBaker commented May 22, 2017

menshikh-iv commented May 23, 2017

tmylk commented May 23, 2017 •

edited

Loading

HarryBaker commented May 23, 2017

piskvorky commented May 29, 2017

menshikh-iv commented Jun 22, 2017

[WIP][DNM] Visualize topic model difference (need feedback) #1243

[WIP][DNM] Visualize topic model difference (need feedback) #1243

Conversation

menshikh-iv commented Mar 26, 2017 • edited Loading

Case 1: Difference between two models

Case 2: One-by-one difference between models in train process

P/S

tmylk commented Mar 28, 2017

menshikh-iv commented Mar 29, 2017

tmylk May 8, 2017 • edited Loading

Choose a reason for hiding this comment

tmylk commented May 10, 2017

HarryBaker commented May 11, 2017

menshikh-iv commented May 12, 2017 • edited by piskvorky Loading

HarryBaker commented May 15, 2017

menshikh-iv commented May 15, 2017

HarryBaker commented May 15, 2017

menshikh-iv commented May 15, 2017

HarryBaker commented May 15, 2017

tmylk commented May 15, 2017

HarryBaker commented May 15, 2017

menshikh-iv commented May 16, 2017

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

menshikh-iv commented May 17, 2017

HarryBaker commented May 17, 2017

menshikh-iv commented May 18, 2017

HarryBaker commented May 22, 2017

menshikh-iv commented May 23, 2017

tmylk commented May 23, 2017 • edited Loading

HarryBaker commented May 23, 2017

piskvorky commented May 29, 2017

menshikh-iv commented Jun 22, 2017

menshikh-iv commented Mar 26, 2017 •

edited

Loading

tmylk May 8, 2017 •

edited

Loading

menshikh-iv commented May 12, 2017 •

edited by piskvorky

Loading

tmylk commented May 23, 2017 •

edited

Loading