-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added EnsembleLda for stable LDA topics #2980
Merged
Merged
Changes from 200 commits
Commits
Show all changes
211 commits
Select commit
Hold shift + click to select a range
7b73db9
added EnsembleLda
241c17e
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
51945e4
Merge branch 'master' of https://github.com/rare-technologies/gensim …
a67d5db
improvements to add_model, various small changes to comments and code
e27be0a
pandas -> numpy: group by label and mean
83de2dd
pandas -> numpy: generate_stable_topics
2af1658
pandas -> numpy: distance matrix creation
100bbf0
pandas -> numpy: CBDBSCAN
aff3287
fixes for automated checks
a545ddf
improvements on logs, comments and variable naming. Changed save func…
d1a6854
minor fix in log message format
3650895
added tests
00a06e9
fixed test
f5f1c9c
removed some dead leftover pandas code from test
c32ddad
removed pathlib from test
dab067f
tests work in python2 locally now
dcc77ef
Merge branch 'master' of https://github.com/rare-technologies/gensim …
eb9ea27
updated ensemble test reference model
6b0dc77
passing tox8
6dc6001
improved determinism of methods
3ec31e7
improved order of assertions
7afd192
trying to achieve higher precision with float64 to avoid some sorting…
16d0357
better approach for comparing with pretrained model
01b68e4
potentially fixing the tests on windows
9314cb4
potentially fixing the tests on windows
b282393
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
0b7febc
changed citation of opinosis
60a717d
tox8 test passing after small change on opinosis comments/citation
2ff60ca
Moving max_random_state inside the model as a private variable.
aloosley d36fe43
removed whitespace
aloosley 1507adf
docstring width
7577aca
sphinx udpate
aloosley 301feac
fixed urls to sphinx notation
64b157e
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
0f4a6b8
changed doc strings, number --> int + some sphinx
aloosley b85fd95
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
f915f50
Removed hanging indents.
aloosley a3161cd
improved topic_model_kind type checking
c63c889
merge
52c239b
Sphinx and docstring updates.
aloosley cb362b5
Merge branch 'EnsembleLda_ReviewJune2019' of github.com:DataReply/gen…
aloosley ffc8e10
review stuff
e54e78c
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
2d269a5
removed unneccessary comments
0612c4a
Update gensim/models/ensemblelda.py
24b34b1
removed paranthesis
d96e1a1
review
a1e3d95
refactor private, hanging indent
5d48c8d
typo
e556e8c
Clarifications to ttda in docstrings and in method docstrings.
aloosley 4ac43d0
solved merge conflict
aloosley 7603045
merge conflict fixed
aloosley dc566f3
docstrings, masks explained and mask warning removed
9d57533
created internal variable for cosine distance calculations
d27cd59
cbdbscan docstring
42aa7ad
moved validate_core outside
eaf62d6
added citation note
aloosley 2e2eb16
moved more stuff outside of _generate_stable_topics
5d23f1c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
0a6c1f6
typos
0002982
explained CBDBSCAN
aloosley 954659a
merged to remote --> CBDBSCAN explanation
aloosley 9675016
added extra explanation:
aloosley b53704a
using none instead of nan for unchecked core
06c5659
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
4f3de96
updated docs
aloosley 71c083c
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley 2ed2cfe
refactored kind to class, fixed check how to proceed with topic_model…
f39d09c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
2abce75
reverted change that accidentally broke things
511eaa5
fixed tests locally
96d6fbd
fix code style
819a05f
added _is_easy_valid_cluster
e3025f6
updated thesis reference
95e1b79
updated notebook example, typo
cd7934d
Merge branch 'master' of https://github.com/RaRe-Technologies/gensim …
3d11649
docstring styles, renaming, cleanup, stuff I need to discuss first
71b3825
tox
2d184a4
fixed stuff in CBDBSCAN
b0c5155
removed unused results column and only CB-Distance to other cores
70a0660
tox whitespace
c8b23ab
cleaned obsolete stuff from cbdbscan
6f98624
idk
799c112
updated doc-strings to be clearer and better reflect the truth
aloosley 9866ef6
make flake8 happy
mpenkov ad97ead
fix trailing whitespace
mpenkov 765b912
reverted some changes
32ff401
comma, newline, comment
b00e010
whitespace
c58c30e
citation, reference, authors
637640c
potential fix for utils saveload when a class is in __dict__
d7efe3a
commented out eLDA tests, tox8
644f2e4
saving the topic_model_class using a string instead
a7bbfc0
reverted utils
2abdbf7
saving the topic_model_class using a string instead fixes
c559de6
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
254ca60
tox
aec58c9
quotes in logger.error
f7de190
multiline string
dfc97a0
python 3.5 format strings
88b338b
ModuleNotFoundError: No module named 'numpy.random._pickle'
b96c041
ModuleNotFoundError: No module named 'numpy.random._pickle' x2
584cd70
fixed inference
c1cd036
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
766f562
removed print asdf
46b7cf6
added spec for inference
6689eae
tox
b1f596d
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
80f4f04
lazy loading topic_model_class
b05bf11
tox
39a9b62
removed debug thing
09ded13
Documents now compile
aloosley 9685b00
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley 4bd7cf7
escape sequence thing indent fix
49253df
Better document rendering and added opinosiscorpus to apirefs
aloosley 50fa88b
docstring styling on opinosiscorpus.py
5669f58
citation opinosis
f63eb03
Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda
2aabe8f
missing opinosiscorpus.rst file committed
aloosley f0600dd
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley 9c25cf5
p names refactored to be descriptive, now using append for appending …
aloosley 4079c27
Changing to hanging indents where they were not used before
aloosley f5379ff
Adding :meth: and `` `` styling for RST
aloosley 591cf77
a bunch of reviews
1fbafcb
merge
7366495
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
f1aba3e
* Changed ensemblelda default to use ldamulticore instead of old lda …
aloosley 9833e44
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley a9428de
More docstring polish
aloosley 3bdeaf2
removing some camel-case vars for pep8 compliance.
aloosley 806952e
a bunch of reviews
e1344bd
merge
3d24f62
fixed linter
dd5cd6e
Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda
e3f5021
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
4b92849
Merge remote-tracking branch 'remotes/original/master' into EnsembleLda
cf24141
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
619922d
less precision for windows
a6ae08e
Merge remote-tracking branch 'remotes/origin/EnsembleLda' into Ensemb…
3f00414
hanging indents
a15bb42
somehow recognized some old versions of unrelated files as current ch…
17e3e11
fixed autoformat of IDEA
1bb0194
same thing for tox.ini
9baa79f
hanging indents in opinosis notebook
1fa3729
I hate windows
d3a6151
test for LdaMulcitore ensemble similarity
216ae64
no
e238fc9
fixed wrong max calculation in loop
68d5814
improved comment for sorting clusters
d299f07
added ensemblelda tutorial
1493396
added test that starts with an empty ensemble
87a11c5
some logging of auto parameters
beedbd3
update auto_examples
e84e8c2
changed eps in example
0b5aaf2
added to tutorials
3bd838e
rebuilt
3057927
update docs
847874a
attempt at fixing that pickle problem
c073fff
Merge remote-tracking branch 'original/develop' into EnsembleLda
f8634ef
merge
f7bc8b2
idk
270f4a8
review
c39848c
stream documents instead of loading all into memory
af39b37
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
05bd52e
streaming docs instead of loading all into memory
aloosley 6addfc3
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley 115cd15
.format replaced with f-string
aloosley f56b60d
docstring
aloosley 18a86a3
Merge branch 'develop' into EnsembleLda
piskvorky 329675f
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
piskvorky 28739ce
tuple variable expansion
aloosley e11e77c
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley 3369975
removed duplicate string in module docstring
ed91a3f
removed obsolete parent connection comment
sezanzeb 42ac449
reliable -> stable
sezanzeb bbd6c2f
topic model intro v1
aloosley c032b20
topic model intro v2
sezanzeb 8c99b51
topic model intro v3
sezanzeb 41e3591
elda introduction with references
aloosley 8b4a3f9
merge develop
aloosley 3ed4522
Update gensim/corpora/opinosiscorpus.py
sezanzeb c1d2d57
Update gensim/models/ensemblelda.py
sezanzeb 128c9dc
update docstrings
2ee6d6f
Merge branch 'EnsembleLda' of https://github.com/sezanzeb/gensim into…
e07906a
static functions
8647aea
module-level constant
5109799
assert and no return
0dd9f29
static _calculate_asymmetric_distance_matrix_chunk
78a16b4
better variable names and data types
aloosley 8c10975
refactoring
dbb7581
pythonic varnames + pytest style asserts
aloosley 0081b6b
merge feature branch
aloosley 06cc33a
better var names
aloosley ec4b487
better var names
aloosley c4a46e9
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley 531ac6a
simplified some function calls to use attributes instead of parameters
59270dc
Merge branch 'EnsembleLda' of https://github.com/sezanzeb/gensim into…
0311d94
sort key function
aloosley 3e00f19
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley 9617e8f
more efficient tests with better case names
aloosley 07f5148
new reference model
6dc4bcb
updated opinosis example
1e5108c
tox
9ac3439
using dataclasses
773ce17
updated type syntax for docstring
4d674f9
unused import
c35fb01
update sbt install step
mpenkov 71b33dd
minor refactoring
mpenkov 444c190
roll back change to docs/src/Makefile
mpenkov f00aca8
re-raise caught exception instead of raising a new one
mpenkov cac6819
add docstring
mpenkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"scrolled": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import logging\n", | ||
"from gensim.models import EnsembleLda, LdaMulticore\n", | ||
"from gensim.corpora import OpinosisCorpus\n", | ||
"import os" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"enable the ensemble logger to show what it is doing currently" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"elda_logger = logging.getLogger(EnsembleLda.__module__)\n", | ||
"elda_logger.setLevel(logging.INFO)\n", | ||
"elda_logger.addHandler(logging.StreamHandler())" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def pretty_print_topics():\n", | ||
" # note that the words are stemmed so they appear chopped off\n", | ||
" for t in elda.print_topics(num_words=7):\n", | ||
" print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Experiments on the Opinosis Dataset\n", | ||
"\n", | ||
"Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n", | ||
"\n", | ||
"[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Preparing the corpus\n", | ||
"\n", | ||
"First, download the opinosis dataset. On linux it can be done like this for example:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!mkdir ~/opinosis\n", | ||
"!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n", | ||
"!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"path = os.path.expanduser('~/opinosis/')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n", | ||
"It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n", | ||
"\n", | ||
"The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"opinosis = OpinosisCorpus(path)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Training" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**parameters**\n", | ||
"\n", | ||
"**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n", | ||
"\n", | ||
"Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"elda = EnsembleLda(\n", | ||
" corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n", | ||
" passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n", | ||
" topic_model_class='ldamulticore', masking_method='rank',\n", | ||
")\n", | ||
"pretty_print_topics()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"elda.recluster(min_samples=55, eps=0.14)\n", | ||
"pretty_print_topics()" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
512a76ce743dd12482d21784a76b60fe | ||
96cefb1417d54ac8010e38cc739d5ff1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+26.2 KB
docs/src/auto_examples/tutorials/images/thumb/sphx_glr_run_ensemblelda_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, @piskvorky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I've rolled this back, since it seems erroneous. Please let me know if I've misunderstood.