-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2vec coherence #1530
Word2vec coherence #1530
Conversation
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
…ures, and add test case to sanity check `word2vec_similarity`.
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
…bility distributions for the probabilistic topic models.
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
…ures, and add test case to sanity check `word2vec_similarity`.
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
…bility distributions for the probabilistic topic models.
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
…f the executables are not installed, instead of passing them inappropriately.
…c_coherence # Conflicts: # gensim/test/test_coherencemodel.py # gensim/topic_coherence/direct_confirmation_measure.py # gensim/topic_coherence/indirect_confirmation_measure.py
…ew Word2Vec-based coherence metric "c_w2v".
…t of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it.
@menshikh-iv I believe this PR is now ready to go. I responded to the change requests in the original PR: I've added the Please let me know if there are any additional changes needed. Thanks! |
gensim/models/coherencemodel.py
Outdated
@@ -261,6 +260,10 @@ def for_topics(cls, topics_as_topn_terms, **kwargs): | |||
for topic in topic_list: | |||
topn = max(topn, len(topic)) | |||
|
|||
if 'topn' in kwargs: | |||
topn = min(kwargs.get('topn'), topn) | |||
del kwargs['topn'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have a look at dict.pop()
.
@piskvorky @menshikh-iv all merge conflicts have been resolved and all requested changes made. Please let me know if there is anything else to be done here. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR, thank you @macks22 🔥 !!
Great update for notebook (easy to read), and general idea looks good.
Please resolve my small comments and I'll merge it!
"Wall time: 43.7 s\n" | ||
"Dictionary(24593 unique tokens: [u'woods', u'hanging', u'woody', u'localized', u'gaa']...)\n", | ||
"CPU times: user 20 s, sys: 797 ms, total: 20.8 s\n", | ||
"Wall time: 21.1 s\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%%time\n", | ||
"\n", | ||
"corpus = NewsgroupCorpus(data_path)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a reason to use NewsgroupCorpus
instead of sklearn data interface? For comfort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a comfort thing -- I already had code for this. I've updated to use the sklearn.datasets.fetch_20newsgroups
function for retrieving the data.
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"WARNING:gensim.topic_coherence.indirect_confirmation_measure:at least one topic word not in word2vec model\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please suppress this warning here (and below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
|
||
def cosine_similarity(segmented_topics, accumulator, topics, measure='nlr', gamma=1, | ||
with_std=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use vertical indent (here you have not very long line) please make definition in one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new argument here, which made it longer, so I used the hanging indent instead.
…er to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.
@menshikh-iv thank you for the review; I've addressed your changes. In updating the notebook, I also noticed a few other minor things to improve the robustness of this code, so I made some other minor changes. Please let me know if there is anything else you'd like modified. Thanks! |
Thanks for your work @macks22, you are TOP contributor 🔥 |
@macks22 this functionality is not documented - would be great to add the docs 👍 |
@ydennisy May I interest you in making a PR? ;) |
Resolves #1380.
This PR adds a new coherence measure to CoherenceModel, called "c_w2v". This uses word2vec word vectors for computing similarity between terms to calculate coherence. This implementation adds a new accumulator in the text_analysis module that either uses pre-trained KeyedVectors or trains a Word2Vec model on the input corpus to derive KeyedVectors. A new keyword argument keyed_vectors is added to the CoherenceModel to pass in pre-trained vectors.
Rather than requiring parity between the corpus dictionary and the keyed vectors vocabulary, the coherence calculation ignores terms that are missing from the vocabulary (with a warning). This PR also adds a new with_std argument to the get_coherence_per_topic method to calculate the standard deviation between the various segments for each topic. This can serve as some indication of topics that have good overall coherence but a few notable outlier terms or subsets.
Finally, I noticed that the CoherenceModel's method of getting topics from models was implemented in a switch-style manner and refactored it to use polymorphism. Specifically, I introduced a new get_topics method to all topic models that returns the topic-term distributions (except for LSI, which just returns the weights, since its not a real distribution).