-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix docstrings for gensim.models.AuthorTopicModel
#1907
Conversation
The PR aims to fix the docstrings for Author-topic model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, please fix current problems & write docstrings for the next part of this file
gensim/models/atmodel.py
Outdated
|
||
Parameters | ||
---------- | ||
doc2author: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict of ???
here and everywhere
gensim/models/atmodel.py
Outdated
|
||
Parameters | ||
---------- | ||
corpus: list of list of str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe iterable of ...
?
gensim/models/atmodel.py
Outdated
---------- | ||
eta: float | ||
Dirichlet topic parameter for sparsity. | ||
lambda_shape: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect type
@@ -107,96 +134,69 @@ def construct_author2doc(doc2author): | |||
|
|||
|
|||
class AuthorTopicModel(LdaModel): | |||
""" | |||
The constructor estimates the author-topic model parameters based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need end2end example of usage in docstring (i.e. I can copy-paste it and this should works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done at the end of the docstring as a separate section.
gensim/models/atmodel.py
Outdated
---------- | ||
num_topic: int, optional | ||
Number of topics to be extracted from the training corpus. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed empty lines between parameter definition (here and everywhere)
@@ -68,6 +68,18 @@ class AuthorTopicState(LdaState): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing docstring for module (examples of usage, liks to related papers, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a usage example in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's about related papers/links/etc? add one more example - almost always a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the paper on Author topic model to the top of the script.
@souravsingh hi, have you some schedule for this file (some date when you plan to finish)? |
I will be making some final updates by today |
You have many "TODO" now, good luck, I hope you will succeed today @souravsingh 👍 |
I have pushed the changes for the rest of the functions. I would need a review for the changes before the patch can be merged. |
gensim/models/atmodel.py
Outdated
Example | ||
------- | ||
>>> import numpy as np | ||
>>> from gensim.models import AuthorTopicModel | ||
>>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word) # train model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example doesn't work (try to copy-paste it to console and run), this must work.
|
||
Parameters | ||
---------- | ||
expElogthetad: numpy.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is parameter description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasnt sure about what to write for the description, which is why I left it blank.
gensim/models/atmodel.py
Outdated
@@ -379,6 +381,26 @@ def inference(self, chunk, author2doc, doc2author, rhot, collect_sstats=False, c | |||
Avoids computing the `phi` variational parameter directly using the | |||
optimization presented in **Lee, Seung: Algorithms for non-negative matrix factorization, NIPS 2001**. | |||
|
|||
Parameters | |||
---------- | |||
chunk: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk: int
-> chunk : int
(here and everywhere)
gensim/models/atmodel.py
Outdated
---------- | ||
chunk: int | ||
The chunk numer of the sparse document vector on which inference needs to be done. | ||
author2doc: dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict of ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a dict of str and str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict of (type, type)
, check by tests, what's types should be here
gensim/models/atmodel.py
Outdated
@@ -332,6 +323,11 @@ def extend_corpus(self, corpus): | |||
are added in the process. If serialization is not used, the corpus, as a list | |||
of documents, is simply extended. | |||
|
|||
Parameters | |||
---------- | |||
corpus: list of list of str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure?
@@ -68,6 +68,18 @@ class AuthorTopicState(LdaState): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's about related papers/links/etc? add one more example - almost always a good idea
@souravsingh ping me when you'll be ready with PR please |
@menshikh-iv The patch is ready for review. I will fix the flake8 issues once I take care of the comments. |
gensim/models/atmodel.py
Outdated
@@ -58,7 +55,7 @@ | |||
class AuthorTopicState(LdaState): | |||
""" | |||
NOTE: distributed mode not available yet in the author-topic model. This AuthorTopicState | |||
object is kept so that when the time comes to imlement it, it will be easier. | |||
object is kept so that when the time comes to implement it, it will be easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use
Notes
--------
....
Instead of NOTE:
and similar things
gensim/models/atmodel.py
Outdated
|
||
Do not call this method directly, instead use `model[author_names]`. | ||
|
||
Parameters | ||
---------- | ||
author_names : str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incorrect type
gensim/models/atmodel.py
Outdated
@@ -950,10 +1085,22 @@ def __getitem__(self, author_names, eps=None): | |||
Return topic distribution for input author as a list of | |||
(topic_id, topic_probabiity) 2-tuples. | |||
|
|||
Ingores topics with probaility less than `eps`. | |||
Ignores topics with probaility less than `eps`. | |||
|
|||
Do not call this method directly, instead use `model[author_names]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be wrapped in Warnings
section
gensim/models/atmodel.py
Outdated
|
||
Example | ||
------- | ||
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't work, you should define all variable & make all imports first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple test for example: I copy-paste it to REPL, run and this works successfully
gensim/models/atmodel.py
Outdated
|
||
Returns | ||
------- | ||
list of 2-tuples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be list of (some_type_1, some_type_2)
, here and everywhere
gensim/models/atmodel.py
Outdated
---------- | ||
chunk : int | ||
The chunk numer of the sparse document vector on which inference needs to be done. | ||
author2doc : dict of {str: list of ints} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ()
instead of {}
in type definition, here and everywhere
gensim/models/atmodel.py
Outdated
For other parameter settings, see :class:`AuthorTopicModel` constructor. | ||
Parameters | ||
---------- | ||
corpus : iterable of iterable of (int, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use iterable of list of (int, number)
, here and everywhere
gensim/models/atmodel.py
Outdated
Parameters | ||
---------- | ||
expElogthetad: numpy.ndarray | ||
Value of variational distribution q(theta|gamma). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use latex for this stuff (this will be rendered in documentation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the representation for this in LaTex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@souravsingh look into original paper + :math: in sphinx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be like this- q(\theta|\gamma)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@souravsingh maybe, I'm not sure, check original paper.
gensim/models/atmodel.py
Outdated
>>> model = AuthorTopicModel(corpus, num_topics=100, author2doc=author2doc, id2word=id2word) # train model | ||
>>> model.update(corpus2) # update the author-topic model with additional documents | ||
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example must work, corpus
, author2doc
, id2word
undefined here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "update with additional", but you use exactly same corpus
here, this confused
@souravsingh please ping me when you'll be ready |
@souravsingh how is going? when you plan to finish PR? |
@menshikh-iv Ready for review |
gensim/models/atmodel.py
Outdated
@@ -19,15 +19,12 @@ | |||
The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class | |||
inherits the LdaModel class, and its usage is thus similar. | |||
|
|||
Distributed computation and multiprocessing is not implemented at the moment, but may be | |||
coming in the future. | |||
The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in [1]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use [1]
references (this doesn't works properly with one of our sphinx plugin), intead of, you should use something like, here and everywhere
The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in `The Author-Topic Model for Authors and Documents <https://arxiv.org/abs/1207.4169>`_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's a good idea to add example of usage in current docstring (head of file)
gensim/models/atmodel.py
Outdated
---------- | ||
eta: float | ||
Dirichlet topic parameter for sparsity. | ||
lambda_shape: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect types (this is (int, int)
, 2d arrays)
gensim/models/atmodel.py
Outdated
|
||
Parameters | ||
---------- | ||
corpus: list of list of str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iterable of list of str
gensim/models/atmodel.py
Outdated
---------- | ||
corpus: list of list of str | ||
Corpus of documents. | ||
author2doc: dict of (str: list of int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should looks like dict of (str, list of int)
, here and everywhere (for all dict of ..
)
|
||
Model persistency is achieved through its `load`/`save` methods. | ||
""" | ||
"""The constructor estimates the author-topic model parameters based on a training corpus.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to move example here (instead of __init__
)
gensim/models/atmodel.py
Outdated
... } | ||
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm')) | ||
>>> model = AuthorTopicModel(corpus, author2doc=author2doc, id2word=dictionary, num_topics=4, passes=100) # train model | ||
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but how to apply model? This is main usecase.
gensim/models/atmodel.py
Outdated
@@ -306,6 +330,7 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d | |||
self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy) | |||
|
|||
def __str__(self): | |||
"""Return a string representation of AuthorTopicModel class.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, where is Returns
section?
gensim/models/atmodel.py
Outdated
Parameters | ||
---------- | ||
expElogthetad: numpy.ndarray | ||
Value of variational distribution .. math:: q(theta|gamma). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is incorrect using of math
(don't rendered)
gensim/models/atmodel.py
Outdated
|
||
Returns | ||
------- | ||
list of (topic_id, topic_probability) as a 2-tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of (int, float)
gensim.models.AuthorTopicModel
ping @souravsingh, how is going? |
gensim/models/atmodel.py
Outdated
>>> model = AuthorTopicModel(corpus, author2doc=author2doc, id2word=dictionary, num_topics=4, passes=100) # train model | ||
>>> model.update(corpus, author2doc) # update the author-topic model with additional documents | ||
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()] | ||
>>> print(author_vecs) #Prints top authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't top authors, this is all authors.
gensim/models/atmodel.py
Outdated
---------- | ||
chunk : int | ||
The chunk numer of the sparse document vector on which inference needs to be done. | ||
author2doc : dict of (strm list of ints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ints
-> int
strm
-> str,
here and everywhere
gensim/models/atmodel.py
Outdated
|
||
Returns | ||
------- | ||
list of (int, float) as a 2-tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of (int, float)
gensim/models/atmodel.py
Outdated
@@ -19,14 +19,31 @@ | |||
The model is closely related to Latent Dirichlet Allocation. The AuthorTopicModel class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be references to classes like
:class:`~gensim.models...`
gensim/models/atmodel.py
Outdated
|
||
A tutorial can be found at | ||
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks/atmodel_tutorial.ipynb. | ||
The model was introduced by Rosen-Zvi and co-authors in 2004 and is described in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it's a good idea to add basic description here
- what's an use cases for this model
- how this works (in general)
Thank you @souravsingh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes.
chunksize : int, optional | ||
Controls the size of the mini-batches. | ||
passes : int, optional | ||
Number of times the model makes a pass over the entire training data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
training data
=> training corpus
(consistency helps with clarity)
Threshold value of gamma(topic difference between consecutive two topics) | ||
until which the iterations continue. | ||
serialized : bool, optional | ||
Indicates whether the input corpora to the model are simple lists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are "input corpora"? Where do they come from?
What does "or saved to the hard-drive" mean? If false, "input corpora" are saved to a hard-drive? Why are we saving input?
The documentation of these two parameters (serialized
and serialization_path
) is confusing. Who should use this and why?
This is important because the docstring example at the top actually uses these parameters (that's how I found them, I wanted to know what it is).
@@ -306,14 +309,21 @@ def __init__(self, corpus=None, num_topics=100, id2word=None, author2doc=None, d | |||
self.update(corpus, author2doc, doc2author, chunks_as_numpy=use_numpy) | |||
|
|||
def __str__(self): | |||
"""Get a string representation of object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this docstring completely (no added value, just adds fluff).
|
||
Raises | ||
------ | ||
AssertionError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssertionError
is for programmer mistakes. Bad user input is either ValueError
or TypeError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for the author of this code, not for me.
The PR aims to fix the docstrings for Author-topic model.
Fixes the documentation of Author-Topic model.