Word2vec coherence #1530

macks22 · 2017-08-13T20:15:09Z

Resolves #1380.

This PR adds a new coherence measure to CoherenceModel, called "c_w2v". This uses word2vec word vectors for computing similarity between terms to calculate coherence. This implementation adds a new accumulator in the text_analysis module that either uses pre-trained KeyedVectors or trains a Word2Vec model on the input corpus to derive KeyedVectors. A new keyword argument keyed_vectors is added to the CoherenceModel to pass in pre-trained vectors.

Rather than requiring parity between the corpus dictionary and the keyed vectors vocabulary, the coherence calculation ignores terms that are missing from the vocabulary (with a warning). This PR also adds a new with_std argument to the get_coherence_per_topic method to calculate the standard deviation between the various segments for each topic. This can serve as some indication of topics that have good overall coherence but a few notable outlier terms or subsets.

Finally, I noticed that the CoherenceModel's method of getting topics from models was implemented in a switch-style manner and refactored it to use polymorphism. Specifically, I introduced a new get_topics method to all topic models that returns the topic-term distributions (except for LSI, which just returns the weights, since its not a real distribution).

…milarity.

… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

…ures, and add test case to sanity check `word2vec_similarity`.

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

…bility distributions for the probabilistic topic models.

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

…asures.

…milarity.

… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

…ures, and add test case to sanity check `word2vec_similarity`.

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

…bility distributions for the probabilistic topic models.

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

…asures.

…f the executables are not installed, instead of passing them inappropriately.

…c_coherence # Conflicts: # gensim/test/test_coherencemodel.py # gensim/topic_coherence/direct_confirmation_measure.py # gensim/topic_coherence/indirect_confirmation_measure.py

…ew Word2Vec-based coherence metric "c_w2v".

…t of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it.

macks22 · 2017-08-29T11:46:02Z

@menshikh-iv I believe this PR is now ready to go. I responded to the change requests in the original PR: I've added the get_topics method to the BaseTopicModel and I've updated the notebook to demonstrate how to use the new coherence measure. I believe the existing CI check failures are all due to the imports in the corpora.__init__, which are being viewed as "unused imports" despite their intentionality.

Please let me know if there are any additional changes needed. Thanks!

piskvorky · 2017-08-29T13:11:23Z

gensim/models/coherencemodel.py

@@ -261,6 +260,10 @@ def for_topics(cls, topics_as_topn_terms, **kwargs):
            for topic in topic_list:
                topn = max(topn, len(topic))

+        if 'topn' in kwargs:
+            topn = min(kwargs.get('topn'), topn)
+            del kwargs['topn']


Have a look at dict.pop().

…ng topn in `CoherenceModel.for_topics`.

macks22 · 2017-09-09T21:32:25Z

@piskvorky @menshikh-iv all merge conflicts have been resolved and all requested changes made. Please let me know if there is anything else to be done here. Thanks!

menshikh-iv

Very nice PR, thank you @macks22 🔥 !!
Great update for notebook (easy to read), and general idea looks good.

Please resolve my small comments and I'll merge it!

menshikh-iv · 2017-09-13T09:26:22Z

docs/notebooks/topic_coherence_model_selection.ipynb

-      "Wall time: 43.7 s\n"
+      "Dictionary(24593 unique tokens: [u'woods', u'hanging', u'woody', u'localized', u'gaa']...)\n",
+      "CPU times: user 20 s, sys: 797 ms, total: 20.8 s\n",
+      "Wall time: 21.1 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "corpus = NewsgroupCorpus(data_path)\n",


What's a reason to use NewsgroupCorpus instead of sklearn data interface? For comfort?

This was a comfort thing -- I already had code for this. I've updated to use the sklearn.datasets.fetch_20newsgroups function for retrieving the data.

menshikh-iv · 2017-09-13T09:30:30Z

docs/notebooks/topic_coherence_model_selection.ipynb

+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "WARNING:gensim.topic_coherence.indirect_confirmation_measure:at least one topic word not in word2vec model\n",


Please suppress this warning here (and below)

menshikh-iv · 2017-09-13T09:34:09Z

gensim/topic_coherence/indirect_confirmation_measure.py

+
+
+def cosine_similarity(segmented_topics, accumulator, topics, measure='nlr', gamma=1,
+                      with_std=False):


Don't use vertical indent (here you have not very long line) please make definition in one line

I added a new argument here, which made it longer, so I used the hanging indent instead.

…er to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.

macks22 · 2017-09-16T21:02:03Z

@menshikh-iv thank you for the review; I've addressed your changes. In updating the notebook, I also noticed a few other minor things to improve the robustness of this code, so I made some other minor changes. Please let me know if there is anything else you'd like modified. Thanks!

menshikh-iv · 2017-09-18T08:35:37Z

Thanks for your work @macks22, you are TOP contributor 🔥

ydennisy · 2020-09-03T15:18:36Z

@macks22 this functionality is not documented - would be great to add the docs 👍

mpenkov · 2020-09-17T09:17:32Z

@ydennisy May I interest you in making a PR? ;)

Sweeney, Mack added 18 commits June 14, 2017 13:46

piskvorky#1380: Initial implementation of coherence using word2vec si…

b1aa1d9

…milarity.

piskvorky#1380: Add tests for with_std option for confirmation meas…

60096e1

…ures, and add test case to sanity check `word2vec_similarity`.

piskvorky#1380: Add a get_topics method to all topic models, add te…

042ac8b

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

piskvorky#1380: Require topics returned from get_topics to be proba…

98e74b1

…bility distributions for the probabilistic topic models.

piskvorky#1380: Clean up flake8 warnings.

aebf987

piskvorky#1380: Make topn a property so setting it to higher values…

a13ba74

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

piskvorky#1380: Pass through with_std argument for all coherence me…

8690dc3

…asures.

piskvorky#1380: Initial implementation of coherence using word2vec si…

a1f9127

…milarity.

piskvorky#1380: Add tests for with_std option for confirmation meas…

94fe67b

…ures, and add test case to sanity check `word2vec_similarity`.

piskvorky#1380: Add a get_topics method to all topic models, add te…

24686ce

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

piskvorky#1380: Require topics returned from get_topics to be proba…

0b0b7ec

…bility distributions for the probabilistic topic models.

piskvorky#1380: Clean up flake8 warnings.

92e5455

piskvorky#1380: Make topn a property so setting it to higher values…

f8ecab7

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

piskvorky#1380: Pass through with_std argument for all coherence me…

59f9fb7

…asures.

Update test_coherencemodel to skip Mallet and Vowpal Wabbit tests i…

6e1c76c

…f the executables are not installed, instead of passing them inappropriately.

Merge remote-tracking branch 'origin/word2vec_coherence' into word2ve…

0cd16b6

…c_coherence # Conflicts: # gensim/test/test_coherencemodel.py # gensim/topic_coherence/direct_confirmation_measure.py # gensim/topic_coherence/indirect_confirmation_measure.py

macks22 mentioned this pull request Aug 13, 2017

Word2vec coherence #1416

Closed

Sweeney, Mack added 8 commits August 15, 2017 07:32

Fix trailing whitespace.

096a6b4

Merge branch 'develop' into word2vec_coherence

494a530

Add get_topics method to BaseTopicModel and update notebook for n…

3f7926e

…ew Word2Vec-based coherence metric "c_w2v".

fix flake8 whitespace issues

fd78388

fix order of imports in corpora.__init__

ad0876a

fix corpora.__init__ import order

b678518

push fix for setting topn in CoherenceModel.for_topics

297711c

piskvorky reviewed Aug 29, 2017

View reviewed changes

piskvorky requested a review from menshikh-iv August 29, 2017 13:12

piskvorky assigned menshikh-iv Aug 29, 2017

piskvorky removed the request for review from menshikh-iv August 29, 2017 13:12

Sweeney, Mack added 4 commits September 6, 2017 08:43

Use dict.pop in place of checking and optionally getting and deleti…

6d1d5f4

…ng topn in `CoherenceModel.for_topics`.

Merge branch 'develop' into word2vec_coherence

8ce713e

fix non-deterministic test failure in test_coherencemodel

4c27169

Merge branch 'develop' into word2vec_coherence

f1113cd

menshikh-iv suggested changes Sep 13, 2017

View reviewed changes

menshikh-iv approved these changes Sep 18, 2017

View reviewed changes

menshikh-iv merged commit 4c0737a into piskvorky:develop Sep 18, 2017

macks22 deleted the word2vec_coherence branch September 18, 2017 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word2vec coherence #1530

Word2vec coherence #1530

macks22 commented Aug 13, 2017

macks22 commented Aug 29, 2017

piskvorky Aug 29, 2017

macks22 commented Sep 9, 2017

menshikh-iv left a comment

menshikh-iv Sep 13, 2017

macks22 Sep 16, 2017

menshikh-iv Sep 13, 2017

macks22 Sep 16, 2017

menshikh-iv Sep 13, 2017

macks22 Sep 16, 2017

macks22 commented Sep 16, 2017

menshikh-iv commented Sep 18, 2017

ydennisy commented Sep 3, 2020

mpenkov commented Sep 17, 2020



		def cosine_similarity(segmented_topics, accumulator, topics, measure='nlr', gamma=1,
		with_std=False):

Word2vec coherence #1530

Word2vec coherence #1530

Conversation

macks22 commented Aug 13, 2017

macks22 commented Aug 29, 2017

piskvorky Aug 29, 2017

Choose a reason for hiding this comment

macks22 commented Sep 9, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv Sep 13, 2017

Choose a reason for hiding this comment

macks22 Sep 16, 2017

Choose a reason for hiding this comment

menshikh-iv Sep 13, 2017

Choose a reason for hiding this comment

macks22 Sep 16, 2017

Choose a reason for hiding this comment

menshikh-iv Sep 13, 2017

Choose a reason for hiding this comment

macks22 Sep 16, 2017

Choose a reason for hiding this comment

macks22 commented Sep 16, 2017

menshikh-iv commented Sep 18, 2017

ydennisy commented Sep 3, 2020

mpenkov commented Sep 17, 2020