Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added EnsembleLda for stable LDA topics #2980

Merged
merged 211 commits into from
Jul 22, 2021
Merged
Show file tree
Hide file tree
Changes from 169 commits
Commits
Show all changes
211 commits
Select commit Hold shift + click to select a range
7b73db9
added EnsembleLda
Dec 1, 2018
241c17e
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Dec 1, 2018
51945e4
Merge branch 'master' of https://github.com/rare-technologies/gensim …
Mar 12, 2019
a67d5db
improvements to add_model, various small changes to comments and code
Apr 5, 2019
e27be0a
pandas -> numpy: group by label and mean
Apr 5, 2019
83de2dd
pandas -> numpy: generate_stable_topics
Apr 6, 2019
2af1658
pandas -> numpy: distance matrix creation
Apr 7, 2019
100bbf0
pandas -> numpy: CBDBSCAN
Apr 7, 2019
aff3287
fixes for automated checks
Apr 7, 2019
a545ddf
improvements on logs, comments and variable naming. Changed save func…
Apr 8, 2019
d1a6854
minor fix in log message format
Apr 8, 2019
3650895
added tests
Apr 9, 2019
00a06e9
fixed test
Apr 9, 2019
f5f1c9c
removed some dead leftover pandas code from test
Apr 9, 2019
c32ddad
removed pathlib from test
Apr 9, 2019
dab067f
tests work in python2 locally now
Apr 12, 2019
dcc77ef
Merge branch 'master' of https://github.com/rare-technologies/gensim …
Apr 12, 2019
eb9ea27
updated ensemble test reference model
Apr 12, 2019
6b0dc77
passing tox8
Apr 12, 2019
6dc6001
improved determinism of methods
Apr 13, 2019
3ec31e7
improved order of assertions
Apr 13, 2019
7afd192
trying to achieve higher precision with float64 to avoid some sorting…
Apr 13, 2019
16d0357
better approach for comparing with pretrained model
Apr 14, 2019
01b68e4
potentially fixing the tests on windows
Apr 14, 2019
9314cb4
potentially fixing the tests on windows
Apr 14, 2019
b282393
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Apr 20, 2019
0b7febc
changed citation of opinosis
Apr 21, 2019
60a717d
tox8 test passing after small change on opinosis comments/citation
Apr 22, 2019
2ff60ca
Moving max_random_state inside the model as a private variable.
aloosley Jun 25, 2019
d36fe43
removed whitespace
aloosley Jun 25, 2019
1507adf
docstring width
Jun 25, 2019
7577aca
sphinx udpate
aloosley Jun 25, 2019
301feac
fixed urls to sphinx notation
Jun 25, 2019
64b157e
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
0f4a6b8
changed doc strings, number --> int + some sphinx
aloosley Jun 25, 2019
b85fd95
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
f915f50
Removed hanging indents.
aloosley Jun 25, 2019
a3161cd
improved topic_model_kind type checking
Jun 25, 2019
c63c889
merge
Jun 25, 2019
52c239b
Sphinx and docstring updates.
aloosley Jun 25, 2019
cb362b5
Merge branch 'EnsembleLda_ReviewJune2019' of github.com:DataReply/gen…
aloosley Jun 25, 2019
ffc8e10
review stuff
Jun 25, 2019
e54e78c
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
2d269a5
removed unneccessary comments
Jun 25, 2019
0612c4a
Update gensim/models/ensemblelda.py
Jun 25, 2019
24b34b1
removed paranthesis
Jun 25, 2019
d96e1a1
review
Jun 25, 2019
a1e3d95
refactor private, hanging indent
Jun 25, 2019
5d48c8d
typo
Jun 25, 2019
e556e8c
Clarifications to ttda in docstrings and in method docstrings.
aloosley Jun 25, 2019
4ac43d0
solved merge conflict
aloosley Jun 25, 2019
7603045
merge conflict fixed
aloosley Jul 29, 2019
dc566f3
docstrings, masks explained and mask warning removed
Jul 29, 2019
9d57533
created internal variable for cosine distance calculations
Jul 29, 2019
d27cd59
cbdbscan docstring
Aug 28, 2019
42aa7ad
moved validate_core outside
Aug 28, 2019
eaf62d6
added citation note
aloosley Aug 28, 2019
2e2eb16
moved more stuff outside of _generate_stable_topics
Aug 28, 2019
5d23f1c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
0a6c1f6
typos
Aug 28, 2019
0002982
explained CBDBSCAN
aloosley Aug 28, 2019
954659a
merged to remote --> CBDBSCAN explanation
aloosley Aug 28, 2019
9675016
added extra explanation:
aloosley Aug 28, 2019
b53704a
using none instead of nan for unchecked core
Aug 28, 2019
06c5659
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
4f3de96
updated docs
aloosley Aug 28, 2019
71c083c
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Aug 28, 2019
2ed2cfe
refactored kind to class, fixed check how to proceed with topic_model…
Aug 28, 2019
f39d09c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
2abce75
reverted change that accidentally broke things
Aug 28, 2019
511eaa5
fixed tests locally
Sep 8, 2019
96d6fbd
fix code style
Sep 8, 2019
819a05f
added _is_easy_valid_cluster
Sep 11, 2019
e3025f6
updated thesis reference
Sep 11, 2019
95e1b79
updated notebook example, typo
Sep 12, 2019
cd7934d
Merge branch 'master' of https://github.com/RaRe-Technologies/gensim …
Sep 14, 2019
3d11649
docstring styles, renaming, cleanup, stuff I need to discuss first
Sep 14, 2019
71b3825
tox
Sep 14, 2019
2d184a4
fixed stuff in CBDBSCAN
Sep 15, 2019
b0c5155
removed unused results column and only CB-Distance to other cores
Sep 15, 2019
70a0660
tox whitespace
Sep 15, 2019
c8b23ab
cleaned obsolete stuff from cbdbscan
Sep 15, 2019
6f98624
idk
Oct 27, 2019
799c112
updated doc-strings to be clearer and better reflect the truth
aloosley Oct 28, 2019
9866ef6
make flake8 happy
mpenkov Nov 9, 2019
ad97ead
fix trailing whitespace
mpenkov Nov 9, 2019
765b912
reverted some changes
Nov 11, 2019
32ff401
comma, newline, comment
Nov 11, 2019
b00e010
whitespace
Nov 11, 2019
c58c30e
citation, reference, authors
Nov 11, 2019
637640c
potential fix for utils saveload when a class is in __dict__
Nov 11, 2019
d7efe3a
commented out eLDA tests, tox8
Nov 11, 2019
644f2e4
saving the topic_model_class using a string instead
Nov 23, 2019
a7bbfc0
reverted utils
Nov 23, 2019
2abdbf7
saving the topic_model_class using a string instead fixes
Nov 23, 2019
c559de6
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Nov 23, 2019
254ca60
tox
Nov 23, 2019
aec58c9
quotes in logger.error
Nov 23, 2019
f7de190
multiline string
Nov 23, 2019
dfc97a0
python 3.5 format strings
Nov 23, 2019
88b338b
ModuleNotFoundError: No module named 'numpy.random._pickle'
Nov 23, 2019
b96c041
ModuleNotFoundError: No module named 'numpy.random._pickle' x2
Nov 23, 2019
584cd70
fixed inference
Dec 14, 2019
c1cd036
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Dec 14, 2019
766f562
removed print asdf
Dec 14, 2019
46b7cf6
added spec for inference
Dec 14, 2019
6689eae
tox
Dec 14, 2019
b1f596d
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jan 1, 2020
80f4f04
lazy loading topic_model_class
Jan 23, 2020
b05bf11
tox
Jan 23, 2020
39a9b62
removed debug thing
Jan 23, 2020
09ded13
Documents now compile
aloosley Jan 23, 2020
9685b00
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Jan 23, 2020
4bd7cf7
escape sequence thing indent fix
Jan 23, 2020
49253df
Better document rendering and added opinosiscorpus to apirefs
aloosley Jan 23, 2020
50fa88b
docstring styling on opinosiscorpus.py
Jan 23, 2020
5669f58
citation opinosis
Jan 23, 2020
f63eb03
Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda
Jan 23, 2020
2aabe8f
missing opinosiscorpus.rst file committed
aloosley Jan 24, 2020
f0600dd
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Jan 24, 2020
9c25cf5
p names refactored to be descriptive, now using append for appending …
aloosley Feb 6, 2020
4079c27
Changing to hanging indents where they were not used before
aloosley Feb 6, 2020
f5379ff
Adding :meth: and `` `` styling for RST
aloosley Feb 6, 2020
591cf77
a bunch of reviews
Feb 6, 2020
1fbafcb
merge
Feb 6, 2020
7366495
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Feb 6, 2020
f1aba3e
* Changed ensemblelda default to use ldamulticore instead of old lda …
aloosley Feb 6, 2020
9833e44
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Feb 6, 2020
a9428de
More docstring polish
aloosley Feb 6, 2020
3bdeaf2
removing some camel-case vars for pep8 compliance.
aloosley Feb 6, 2020
806952e
a bunch of reviews
Feb 6, 2020
e1344bd
merge
Feb 6, 2020
3d24f62
fixed linter
Feb 9, 2020
dd5cd6e
Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda
Sep 26, 2020
e3f5021
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Oct 14, 2020
4b92849
Merge remote-tracking branch 'remotes/original/master' into EnsembleLda
Oct 24, 2020
cf24141
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Oct 24, 2020
619922d
less precision for windows
Oct 24, 2020
a6ae08e
Merge remote-tracking branch 'remotes/origin/EnsembleLda' into Ensemb…
Oct 24, 2020
3f00414
hanging indents
Oct 24, 2020
a15bb42
somehow recognized some old versions of unrelated files as current ch…
Oct 24, 2020
17e3e11
fixed autoformat of IDEA
Oct 24, 2020
1bb0194
same thing for tox.ini
Oct 24, 2020
9baa79f
hanging indents in opinosis notebook
Oct 24, 2020
1fa3729
I hate windows
Oct 24, 2020
d3a6151
test for LdaMulcitore ensemble similarity
Oct 24, 2020
216ae64
no
Oct 24, 2020
e238fc9
fixed wrong max calculation in loop
Oct 24, 2020
68d5814
improved comment for sorting clusters
Oct 24, 2020
d299f07
added ensemblelda tutorial
Oct 24, 2020
1493396
added test that starts with an empty ensemble
Oct 24, 2020
87a11c5
some logging of auto parameters
Oct 24, 2020
beedbd3
update auto_examples
Oct 24, 2020
e84e8c2
changed eps in example
Oct 24, 2020
0b5aaf2
added to tutorials
Oct 25, 2020
3bd838e
rebuilt
Oct 25, 2020
3057927
update docs
Jan 23, 2021
847874a
attempt at fixing that pickle problem
Jan 23, 2021
c073fff
Merge remote-tracking branch 'original/develop' into EnsembleLda
Jan 24, 2021
f8634ef
merge
Jan 24, 2021
f7bc8b2
idk
Jan 24, 2021
270f4a8
review
May 5, 2021
c39848c
stream documents instead of loading all into memory
May 5, 2021
af39b37
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
May 5, 2021
05bd52e
streaming docs instead of loading all into memory
aloosley May 5, 2021
6addfc3
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley May 5, 2021
115cd15
.format replaced with f-string
aloosley May 5, 2021
f56b60d
docstring
aloosley May 5, 2021
18a86a3
Merge branch 'develop' into EnsembleLda
piskvorky May 5, 2021
329675f
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
piskvorky May 5, 2021
28739ce
tuple variable expansion
aloosley May 5, 2021
e11e77c
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley May 5, 2021
3369975
removed duplicate string in module docstring
May 5, 2021
ed91a3f
removed obsolete parent connection comment
sezanzeb May 5, 2021
42ac449
reliable -> stable
sezanzeb May 5, 2021
bbd6c2f
topic model intro v1
aloosley May 31, 2021
c032b20
topic model intro v2
sezanzeb May 31, 2021
8c99b51
topic model intro v3
sezanzeb May 31, 2021
41e3591
elda introduction with references
aloosley May 31, 2021
8b4a3f9
merge develop
aloosley Jun 9, 2021
3ed4522
Update gensim/corpora/opinosiscorpus.py
sezanzeb Jun 23, 2021
c1d2d57
Update gensim/models/ensemblelda.py
sezanzeb Jun 23, 2021
128c9dc
update docstrings
Jun 23, 2021
2ee6d6f
Merge branch 'EnsembleLda' of https://github.com/sezanzeb/gensim into…
Jun 23, 2021
e07906a
static functions
Jun 23, 2021
8647aea
module-level constant
Jun 23, 2021
5109799
assert and no return
Jun 23, 2021
0dd9f29
static _calculate_asymmetric_distance_matrix_chunk
Jun 23, 2021
78a16b4
better variable names and data types
aloosley Jun 30, 2021
8c10975
refactoring
Jun 30, 2021
dbb7581
pythonic varnames + pytest style asserts
aloosley Jun 30, 2021
0081b6b
merge feature branch
aloosley Jun 30, 2021
06cc33a
better var names
aloosley Jun 30, 2021
ec4b487
better var names
aloosley Jun 30, 2021
c4a46e9
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley Jun 30, 2021
531ac6a
simplified some function calls to use attributes instead of parameters
Jun 30, 2021
59270dc
Merge branch 'EnsembleLda' of https://github.com/sezanzeb/gensim into…
Jun 30, 2021
0311d94
sort key function
aloosley Jun 30, 2021
3e00f19
Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…
aloosley Jun 30, 2021
9617e8f
more efficient tests with better case names
aloosley Jun 30, 2021
07f5148
new reference model
Jul 4, 2021
6dc4bcb
updated opinosis example
Jul 4, 2021
1e5108c
tox
Jul 4, 2021
9ac3439
using dataclasses
Jul 5, 2021
773ce17
updated type syntax for docstring
Jul 5, 2021
4d674f9
unused import
Jul 5, 2021
c35fb01
update sbt install step
mpenkov Jul 18, 2021
71b33dd
minor refactoring
mpenkov Jul 18, 2021
444c190
roll back change to docs/src/Makefile
mpenkov Jul 18, 2021
f00aca8
re-raise caught exception instead of raising a new one
mpenkov Jul 18, 2021
cac6819
add docstring
mpenkov Jul 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 177 additions & 0 deletions docs/notebooks/ensemble_lda_with_opinosis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import logging\n",
"from gensim.models import EnsembleLda, LdaMulticore\n",
"from gensim.corpora import OpinosisCorpus\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"enable the ensemble logger to show what it is doing currently"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elda_logger = logging.getLogger(EnsembleLda.__module__)\n",
"elda_logger.setLevel(logging.INFO)\n",
"elda_logger.addHandler(logging.StreamHandler())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def pretty_print_topics():\n",
" # note that the words are stemmed so they appear chopped off\n",
" for t in elda.print_topics(num_words=7):\n",
" print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Experiments on the Opinosis Dataset\n",
"\n",
"Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n",
"\n",
"[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing the corpus\n",
"\n",
"First, download the opinosis dataset. On linux it can be done like this for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir ~/opinosis\n",
"!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n",
"!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path = os.path.expanduser('~/opinosis/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n",
"It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n",
"\n",
"The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"opinosis = OpinosisCorpus(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**parameters**\n",
"\n",
"**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n",
"\n",
"Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elda = EnsembleLda(\n",
" corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n",
" passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n",
" topic_model_class='ldamulticore', masking_method='rank',\n",
")\n",
"pretty_print_topics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elda.recluster(min_samples=55, eps=0.14)\n",
"pretty_print_topics()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 2 additions & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ Modules:
corpora/malletcorpus
corpora/mmcorpus
corpora/_mmreader
corpora/opinosiscorpus
corpora/sharded_corpus
corpora/svmlightcorpus
corpora/textcorpus
corpora/ucicorpus
corpora/wikicorpus
models/ldamodel
models/ldamulticore
models/ensemblelda
models/nmf
models/lsimodel
models/ldaseqmodel
Expand Down
10 changes: 5 additions & 5 deletions docs/src/auto_examples/howtos/run_doc.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\nHow to Author Gensim Documentation\n==================================\n\nHow to author documentation for Gensim.\n\n"
"\n# How to Author Gensim Documentation\n\nHow to author documentation for Gensim.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Background\n----------\n\nGensim is a large project with a wide range of functionality.\nUnfortunately, not all of this functionality is documented **well**, and some of it is not documented at all.\nWithout good documentation, users are unable to unlock Gensim's full potential.\nTherefore, authoring new documentation and improving existing documentation is of great value to the Gensim project.\n\nIf you implement new functionality in Gensim, please include **helpful** documentation.\nBy \"helpful\", we mean that your documentation answers questions that Gensim users may have.\nFor example:\n\n- What is this new functionality?\n- **Why** is it important?\n- **How** is it relevant to Gensim?\n- **What** can I do with it? What are some real-world applications?\n- **How** do I use it to achieve those things?\n- ... and others (if you can think of them, please add them here)\n\nBefore you author documentation, I suggest reading\n`\"What nobody tells you about documentation\" <https://www.divio.com/blog/documentation/>`__\nor watching its `accompanying video <https://www.youtube.com/watch?v=t4vKPhjcMZg>`__\n(or even both, if you're really keen).\n\nThe summary of the above presentation is: there are four distinct kinds of documentation, and you really need them all:\n\n1. Tutorials\n2. Howto guides\n3. Explanations\n4. References\n\nEach kind has its own intended audience, purpose, and writing style.\nWhen you make a PR with new functionality, please consider authoring each kind of documentation.\nAt the very least, you will (indirectly) author reference documentation through module, class and function docstrings.\n\nMechanisms\n----------\n\nWe keep our documentation as individual Python scripts.\nThese scripts live under :file:`docs/src/gallery` in one of several subdirectories:\n\n- core: core tutorials. We try to keep this part small, avoid putting stuff here.\n- tutorials: tutorials.\n- howtos: howto guides.\n\nPick a subdirectory and save your script under it.\nPrefix the name of the script with ``run_``: this way, the the documentation builder will run your script each time it builds our docs.\n\nThe contents of the script are straightforward.\nAt the very top, you need a docstring describing what your script does.\n\n"
"## Background\n\nGensim is a large project with a wide range of functionality.\nUnfortunately, not all of this functionality is documented **well**, and some of it is not documented at all.\nWithout good documentation, users are unable to unlock Gensim's full potential.\nTherefore, authoring new documentation and improving existing documentation is of great value to the Gensim project.\n\nIf you implement new functionality in Gensim, please include **helpful** documentation.\nBy \"helpful\", we mean that your documentation answers questions that Gensim users may have.\nFor example:\n\n- What is this new functionality?\n- **Why** is it important?\n- **How** is it relevant to Gensim?\n- **What** can I do with it? What are some real-world applications?\n- **How** do I use it to achieve those things?\n- ... and others (if you can think of them, please add them here)\n\nBefore you author documentation, I suggest reading\n`\"What nobody tells you about documentation\" <https://www.divio.com/blog/documentation/>`__\nor watching its `accompanying video <https://www.youtube.com/watch?v=t4vKPhjcMZg>`__\n(or even both, if you're really keen).\n\nThe summary of the above presentation is: there are four distinct kinds of documentation, and you really need them all:\n\n1. Tutorials\n2. Howto guides\n3. Explanations\n4. References\n\nEach kind has its own intended audience, purpose, and writing style.\nWhen you make a PR with new functionality, please consider authoring each kind of documentation.\nAt the very least, you will (indirectly) author reference documentation through module, class and function docstrings.\n\n## Mechanisms\n\nWe keep our documentation as individual Python scripts.\nThese scripts live under :file:`docs/src/gallery` in one of several subdirectories:\n\n- core: core tutorials. We try to keep this part small, avoid putting stuff here.\n- tutorials: tutorials.\n- howtos: howto guides.\n\nPick a subdirectory and save your script under it.\nPrefix the name of the script with ``run_``: this way, the the documentation builder will run your script each time it builds our docs.\n\nThe contents of the script are straightforward.\nAt the very top, you need a docstring describing what your script does.\n\n"
]
},
{
Expand Down Expand Up @@ -54,14 +54,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Authoring Workflow\n------------------\n\nThere are several ways to author documentation.\nThe simplest and most straightforward is to author your ``script.py`` from scratch.\nYou'll have the following cycle:\n\n1. Make changes\n2. Run ``python script.py``\n3. Check standard output, standard error and return code\n4. If everything works well, stop.\n5. Otherwise, go back to step 1).\n\nIf the above is not your cup of tea, you can also author your documentation as a Jupyter notebook.\nThis is a more flexible approach that enables you to tweak parts of the documentation and re-run them as necessary.\n\nOnce you're happy with the notebook, convert it to a script.py.\nThere's a helpful `script <https://github.com/mpenkov/gensim/blob/numfocus/docs/src/tools/to_python.py>`__ that will do it for you.\nTo use it::\n\n python to_python.py < notebook.ipynb > script.py\n\nYou may have to touch up the resulting ``script.py``.\nMore specifically:\n\n- Update the title\n- Update the description\n- Fix any issues that the markdown-to-RST converter could not deal with\n\nOnce your script.py works, put it in a suitable subdirectory.\nPlease don't include your original Jupyter notebook in the repository - we won't be using it.\n\n"
"## Authoring Workflow\n\nThere are several ways to author documentation.\nThe simplest and most straightforward is to author your ``script.py`` from scratch.\nYou'll have the following cycle:\n\n1. Make changes\n2. Run ``python script.py``\n3. Check standard output, standard error and return code\n4. If everything works well, stop.\n5. Otherwise, go back to step 1).\n\nIf the above is not your cup of tea, you can also author your documentation as a Jupyter notebook.\nThis is a more flexible approach that enables you to tweak parts of the documentation and re-run them as necessary.\n\nOnce you're happy with the notebook, convert it to a script.py.\nThere's a helpful `script <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/src/tools/to_python.py>`__ that will do it for you.\nTo use it::\n\n python to_python.py < notebook.ipynb > script.py\n\nYou may have to touch up the resulting ``script.py``.\nMore specifically:\n\n- Update the title\n- Update the description\n- Fix any issues that the markdown-to-RST converter could not deal with\n\nOnce your script.py works, put it in a suitable subdirectory.\nPlease don't include your original Jupyter notebook in the repository - we won't be using it.\n\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Correctness\n-----------\n\nIncorrect documentation can be worse than no documentation at all.\nTake the following steps to ensure correctness:\n\n- Run Python's doctest module on your docstrings\n- Run your documentation scripts from scratch, removing any temporary files/results\n\nUsing data in your documentation\n--------------------------------\n\nSome parts of the documentation require real-world data to be useful.\nFor example, you may need more than just a toy example to demonstrate the benefits of one model over another.\nThis subsection provides some tips for including data in your documentation.\n\nIf possible, use data available via Gensim's\n`downloader API <https://radimrehurek.com/gensim/gensim_numfocus/auto_examples/010_tutorials/run_downloader_api.html>`__.\nThis will reduce the risk of your documentation becoming obsolete because required data is no longer available.\n\nUse the smallest possible dataset: avoid making people unnecessarily load large datasets and models.\nThis will make your documentation faster to run and easier for people to use (they can modify your examples and re-run them quickly).\n\nFinalizing your contribution\n----------------------------\n\nFirst, get Sphinx Gallery to build your documentation::\n\n make -C docs/src html\n\nThis can take a while if your documentation uses a large dataset, or if you've changed many other tutorials or guides.\nOnce this completes successfully, open ``docs/auto_examples/index.html`` in your browser.\nYou should see your new tutorial or guide in the gallery.\n\nOnce your documentation script is working correctly, it's time to add it to the git repository::\n\n git add docs/src/gallery/tutorials/run_example.py\n git add docs/src/auto_examples/tutorials/run_example.{py,py.md5,rst,ipynb}\n git add docs/src/auto_examples/howtos/sg_execution_times.rst\n git commit -m \"enter a helpful commit message here\"\n git push origin branchname\n\n.. Note::\n You may be wondering what all those other files are.\n Sphinx Gallery puts a copy of your Python script in ``auto_examples/tutorials``.\n The .md5 contains MD5 hash of the script to enable easy detection of modifications.\n Gallery also generates .rst (RST for Sphinx) and .ipynb (Jupyter notebook) files from the script.\n Finally, ``sg_execution_times.rst`` contains the time taken to run each example.\n\nFinally, make a PR on `github <https://github.com/RaRe-Technologies/gensim>`__.\nOne of our friendly maintainers will review it, make suggestions, and eventually merge it.\nYour documentation will then appear in the gallery alongside the rest of the example.\nAt that stage, give yourself a pat on the back: you're done!\n\n"
"## Correctness\n\nIncorrect documentation can be worse than no documentation at all.\nTake the following steps to ensure correctness:\n\n- Run Python's doctest module on your docstrings\n- Run your documentation scripts from scratch, removing any temporary files/results\n\n## Using data in your documentation\n\nSome parts of the documentation require real-world data to be useful.\nFor example, you may need more than just a toy example to demonstrate the benefits of one model over another.\nThis subsection provides some tips for including data in your documentation.\n\nIf possible, use data available via Gensim's\n`downloader API <https://radimrehurek.com/gensim/gensim_numfocus/auto_examples/010_tutorials/run_downloader_api.html>`__.\nThis will reduce the risk of your documentation becoming obsolete because required data is no longer available.\n\nUse the smallest possible dataset: avoid making people unnecessarily load large datasets and models.\nThis will make your documentation faster to run and easier for people to use (they can modify your examples and re-run them quickly).\n\n## Finalizing your contribution\n\nFirst, get Sphinx Gallery to build your documentation::\n\n make -C docs/src html\n\nThis can take a while if your documentation uses a large dataset, or if you've changed many other tutorials or guides.\nOnce this completes successfully, open ``docs/auto_examples/index.html`` in your browser.\nYou should see your new tutorial or guide in the gallery.\n\nOnce your documentation script is working correctly, it's time to add it to the git repository::\n\n git add docs/src/gallery/tutorials/run_example.py\n git add docs/src/auto_examples/tutorials/run_example.{py,py.md5,rst,ipynb}\n git add docs/src/auto_examples/howtos/sg_execution_times.rst\n git commit -m \"enter a helpful commit message here\"\n git push origin branchname\n\n.. Note::\n You may be wondering what all those other files are.\n Sphinx Gallery puts a copy of your Python script in ``auto_examples/tutorials``.\n The .md5 contains MD5 hash of the script to enable easy detection of modifications.\n Gallery also generates .rst (RST for Sphinx) and .ipynb (Jupyter notebook) files from the script.\n Finally, ``sg_execution_times.rst`` contains the time taken to run each example.\n\nFinally, make a PR on `github <https://github.com/RaRe-Technologies/gensim>`__.\nOne of our friendly maintainers will review it, make suggestions, and eventually merge it.\nYour documentation will then appear in the gallery alongside the rest of the example.\nAt that stage, give yourself a pat on the back: you're done!\n\n"
]
}
],
Expand All @@ -81,7 +81,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.9.1"
}
},
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion docs/src/auto_examples/howtos/run_doc.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@
# This is a more flexible approach that enables you to tweak parts of the documentation and re-run them as necessary.
#
# Once you're happy with the notebook, convert it to a script.py.
# There's a helpful `script <https://github.com/mpenkov/gensim/blob/numfocus/docs/src/tools/to_python.py>`__ that will do it for you.
# There's a helpful `script <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/src/tools/to_python.py>`__ that will do it for you.
# To use it::
#
# python to_python.py < notebook.ipynb > script.py
Expand Down
2 changes: 1 addition & 1 deletion docs/src/auto_examples/howtos/run_doc.py.md5
Original file line number Diff line number Diff line change
@@ -1 +1 @@
b3db0b66859316de13e1a36fa6181657
512a76ce743dd12482d21784a76b60fe
Loading