Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate Similarity of Distinct LDA Models #1328

Closed
HarryBaker opened this issue May 16, 2017 · 12 comments
Closed

Calculate Similarity of Distinct LDA Models #1328

HarryBaker opened this issue May 16, 2017 · 12 comments

Comments

@HarryBaker
Copy link

This is a slightly modified version of topic2topic_difference found here: #1243

Rather than comparing the similarity of a single LDA model across training iterations, I want to compare the similarity of two distinct LDA models after training. The idea behind this is to calculate the similarity of two distinct LDA models trained on the exact same data with the exact same parameters. If their similarity is very high, this should indicate that the models are reproducible, and that another person could train a new model on the same data with the same parameters and be confident that their model is the same as ours.

However, I think this has an application beyond testing identical models trained under different seeds. If you use the Jaccard distance of the top N words of each topic, then I think you can compare topics across models trained under different datasets. For instance, if you have two models trained on similar datasets over different periods of time, you can match topics across models and study how they've changed over time. I'm going to do more research into this when I'm confident that the topic matching works.

A quick warning, but this is my first time ever contributing open source code, so I apologize in advance if I do anything wrong in terms of style or work flow. I'm currently working on solving the problem of model reproducibility for my company, and thought that my code might be useful for the gensim community.

@tmylk
Copy link
Contributor

tmylk commented May 16, 2017

Hi, visualisations are the top priority on our roadmap so it would be very welcome.
Just painting the all vs all distance matrix is a good start. But some one number of "best alignment distance" would be nice, very similar idea to word movers distance
How do you suggest to match the topics between models?

@HarryBaker
Copy link
Author

I am going to write a script that uses a modified topic2topic_difference to return the all vs all distance matrix of two topic models. I will use this to pair each topic in model 1 with it's most similar topic in model 2, along with it's distance score. I sort this list by distance, and start assigning the most similar topics as matches. If I run into a collision (that is, a topic that has already been matched), then I find the next most similar pairing for the topic in model 1. I then re-sort the list and continue until I have a 1 to 1 relationship between all topics. This gives me a bijection between topics, and I can then find the average of each pair's distance score to assign a score to the two models. From my tests so far the matching appears to be working reasonably well, though I need to do further testing of the average similarity score.

@tmylk
Copy link
Contributor

tmylk commented May 17, 2017

Please have a look at the Word Movers Distance code in gensim referenced above - the "minimal transport search" algorithm can be re-used from that package

@HarryBaker
Copy link
Author

Ok, I will check it out. Thanks!

@HarryBaker
Copy link
Author

HarryBaker commented May 18, 2017

Word Mover's Distance does seem to work better than the other metrics so far. For the most similar topics it aligns identically with jaccard and kl, but for the "fuzzier" topics I think it does a better job of matching them.

Here is my fork of the project. It's still a work in progress. I need to add in sanity checks and make it fit with gensim's style, but it shows how my code works. It's in branch topic2topic_seperate_models

https://github.com/HarryBaker/gensim

@tmylk
Copy link
Contributor

tmylk commented May 18, 2017

Thanks for looking into it. Could you please add an ipynb illustrating this point?

@HarryBaker
Copy link
Author

I can't publish the data I'm studying now, but tomorrow I will try to find public data to demonstrate with.

@HarryBaker
Copy link
Author

Do you know if there's been any papers written on measuring the reproducibility of LDA models? I've tried to find papers on the subject, and it doesn't seem like it has been studied. This is surprising, because I would think that guaranteeing reproducibility would be a major part of academic research. If this hasn't been studied, my department might look into putting out a research paper on the subject.

@HarryBaker
Copy link
Author

https://pdfs.semanticscholar.org/d6d4/3ee873e40c3186f6313028ef1a4c08225c96.pdf

Seems like it's covering a similar issue. Weirdly, this is the only paper I've found on measuring the stability of LDA topics.

@tmylk
Copy link
Contributor

tmylk commented May 30, 2017

@HarryBaker Stability under different random seeds is indeed an important issue. There is also inherent non-determinism in multithreading in MulticoreLDA vs single core LDAModel that would be nice to measure.

BTW see two model comparison graph in #1374

@macks22
Copy link
Contributor

macks22 commented Jun 15, 2017

@tymlk @HarryBaker
This paper evaluates a variety of techniques for topic similarity, ranking average PMI between top-N representations of the topics as one of the best techniques, along with Explicit Semantic Analysis (ESA). The average PMI approach can easily be implemented using the same code as the CoherenceModel.

The Jaccard similarity between topic-term distributions technique is analyzed and shown to have low agreement with human annotators. I believe this is the metric used by the topic2topic_difference code being referred to. That isn't to say it's not worth having; just that it may be an inferior technique to others that are also easy to integrate.

@menshikh-iv
Copy link
Contributor

Resolved in #1374.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants