Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor documentation for gensim.similarities.docsim. #1910

Merged
merged 18 commits into from
Feb 23, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 22 additions & 4 deletions gensim/corpora/textcorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,13 +216,31 @@ def __init__(self, input=None, dictionary=None, metadata=False, character_filter

Examples
--------
>>> #TODO Example with inheritance
>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim import corpora
>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>>
>>> corpus = TextCorpus(datapath('head500.noblanks.cor.bz2'))
>>> for bow in corpus:
... pass
>>> class CorpusMiislita(corpora.TextCorpus):
>>> stoplist = set('for a of the and to in on'.split())
>>>
>>> def get_texts(self):
>>> for doc in self.getstream():
>>> yield [word for word in utils.to_unicode(doc).lower().split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some issues with formatting

>>> if word not in CorpusMiislita.stoplist]
>>>
>>> def __len__(self):
>>> if 'length' not in self.__dict__:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to write something with logger, this should be simple & small example

>>> logger.info("caching corpus size (calculating number of documents)")
>>> self.length = sum(1 for _ in self.get_texts())
>>> return self.length
>>>
>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> corpus.get_texts()
<generator object get_texts at 0x7fa932f397d0>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad output, can you show the concrete line of the dataset next(iter(corpus.get_texts())) ?

>>> corpus.__len__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please len(dorpus) instead of this one, call "magic" directly is bad pattern (and is justified only for specific cases)

250


"""
self.input = input
Expand Down
Loading