Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression bug in tutorial #2399

Merged
merged 1 commit into from
Mar 20, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 20 additions & 10 deletions docs/src/tut1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,21 @@ as well as words that only appear once in the corpus:
>>>
>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>> for document in documents]
>>> texts = [
>>> [word for word in document.lower().split() if word not in stoplist]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to demonstrate the best practice for string tokenization here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best practice is to use a dedicated library (e.g. NLTK)… do we want to complicate the tutorial?

Maybe we should.

But if we don't, this should be something really stupid (like it is now), so people don't accidentally copy-paste thinking it's a good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize

WDYT?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's an option (but utils.simple_preprocess better),

>>> for document in documents
>>> ]
>>>
>>> # remove words that appear only once
>>> frequency = defaultdict(int)
Copy link
Collaborator

@mpenkov mpenkov Mar 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't part of the PR, but it may be better to simplify this section as:

# untested code but "should" work
import collections
frequency = collections.Counter()
for text in texts:
    frequency.update(text)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

>>> for text in texts:
>>> for token in text:
>>> frequency[token] += 1
>>>
>>> texts = [[token for token in text if frequency[token] > 1]
>>> for text in texts]
>>> texts = [
>>> [token for token in text if frequency[token] > 1]
>>> for text in texts
>>> ]
>>>
>>> pprint(texts)
[['human', 'interface', 'computer'],
Expand Down Expand Up @@ -95,10 +99,13 @@ a question-answer pair, in the style of:
It is advantageous to represent the questions only by their (integer) ids. The mapping
between the questions and ids is called a dictionary:

>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference
>>> print(dictionary)
Dictionary(12 unique tokens)
.. sourcecode:: pycon

>>> from gensim import corpora
>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference
>>> print(dictionary)
Dictionary(12 unique tokens)

Here we assigned a unique integer id to all words appearing in the corpus with the
:class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts
Expand Down Expand Up @@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory:
>>> # collect statistics about all tokens
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
Copy link
Collaborator

@mpenkov mpenkov Mar 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a real corpus here? Also, we may want to show the best practice for string tokenization here.

That way, we can doctest this file.

>>> # remove stop words and words that appear only once
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
>>> if stopword in dictionary.token2id]
>>> stop_ids = [
>>> dictionary.token2id[stopword]
>>> for stopword in stoplist
>>> if stopword in dictionary.token2id
>>> ]
>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
>>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once
>>> dictionary.compactify() # remove gaps in id sequence after words that were removed
Expand Down