-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate Similarity of Distinct LDA Models #1328
Comments
Hi, visualisations are the top priority on our roadmap so it would be very welcome. |
I am going to write a script that uses a modified topic2topic_difference to return the all vs all distance matrix of two topic models. I will use this to pair each topic in model 1 with it's most similar topic in model 2, along with it's distance score. I sort this list by distance, and start assigning the most similar topics as matches. If I run into a collision (that is, a topic that has already been matched), then I find the next most similar pairing for the topic in model 1. I then re-sort the list and continue until I have a 1 to 1 relationship between all topics. This gives me a bijection between topics, and I can then find the average of each pair's distance score to assign a score to the two models. From my tests so far the matching appears to be working reasonably well, though I need to do further testing of the average similarity score. |
Please have a look at the Word Movers Distance code in gensim referenced above - the "minimal transport search" algorithm can be re-used from that package |
Ok, I will check it out. Thanks! |
Word Mover's Distance does seem to work better than the other metrics so far. For the most similar topics it aligns identically with jaccard and kl, but for the "fuzzier" topics I think it does a better job of matching them. Here is my fork of the project. It's still a work in progress. I need to add in sanity checks and make it fit with gensim's style, but it shows how my code works. It's in branch topic2topic_seperate_models |
Thanks for looking into it. Could you please add an ipynb illustrating this point? |
I can't publish the data I'm studying now, but tomorrow I will try to find public data to demonstrate with. |
Do you know if there's been any papers written on measuring the reproducibility of LDA models? I've tried to find papers on the subject, and it doesn't seem like it has been studied. This is surprising, because I would think that guaranteeing reproducibility would be a major part of academic research. If this hasn't been studied, my department might look into putting out a research paper on the subject. |
https://pdfs.semanticscholar.org/d6d4/3ee873e40c3186f6313028ef1a4c08225c96.pdf Seems like it's covering a similar issue. Weirdly, this is the only paper I've found on measuring the stability of LDA topics. |
@HarryBaker Stability under different random seeds is indeed an important issue. There is also inherent non-determinism in multithreading in MulticoreLDA vs single core LDAModel that would be nice to measure. BTW see two model comparison graph in #1374 |
@tymlk @HarryBaker The Jaccard similarity between topic-term distributions technique is analyzed and shown to have low agreement with human annotators. I believe this is the metric used by the topic2topic_difference code being referred to. That isn't to say it's not worth having; just that it may be an inferior technique to others that are also easy to integrate. |
Resolved in #1374. |
This is a slightly modified version of topic2topic_difference found here: #1243
Rather than comparing the similarity of a single LDA model across training iterations, I want to compare the similarity of two distinct LDA models after training. The idea behind this is to calculate the similarity of two distinct LDA models trained on the exact same data with the exact same parameters. If their similarity is very high, this should indicate that the models are reproducible, and that another person could train a new model on the same data with the same parameters and be confident that their model is the same as ours.
However, I think this has an application beyond testing identical models trained under different seeds. If you use the Jaccard distance of the top N words of each topic, then I think you can compare topics across models trained under different datasets. For instance, if you have two models trained on similar datasets over different periods of time, you can match topics across models and study how they've changed over time. I'm going to do more research into this when I'm confident that the topic matching works.
A quick warning, but this is my first time ever contributing open source code, so I apologize in advance if I do anything wrong in terms of style or work flow. I'm currently working on solving the problem of model reproducibility for my company, and thought that my code might be useful for the gensim community.
The text was updated successfully, but these errors were encountered: