-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression bug in tutorial #2399
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,17 +50,21 @@ as well as words that only appear once in the corpus: | |
>>> | ||
>>> # remove common words and tokenize | ||
>>> stoplist = set('for a of the and to in'.split()) | ||
>>> texts = [[word for word in document.lower().split() if word not in stoplist] | ||
>>> for document in documents] | ||
>>> texts = [ | ||
>>> [word for word in document.lower().split() if word not in stoplist] | ||
>>> for document in documents | ||
>>> ] | ||
>>> | ||
>>> # remove words that appear only once | ||
>>> frequency = defaultdict(int) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this isn't part of the PR, but it may be better to simplify this section as: # untested code but "should" work
import collections
frequency = collections.Counter()
for text in texts:
frequency.update(text) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fine with me. |
||
>>> for text in texts: | ||
>>> for token in text: | ||
>>> frequency[token] += 1 | ||
>>> | ||
>>> texts = [[token for token in text if frequency[token] > 1] | ||
>>> for text in texts] | ||
>>> texts = [ | ||
>>> [token for token in text if frequency[token] > 1] | ||
>>> for text in texts | ||
>>> ] | ||
>>> | ||
>>> pprint(texts) | ||
[['human', 'interface', 'computer'], | ||
|
@@ -95,10 +99,13 @@ a question-answer pair, in the style of: | |
It is advantageous to represent the questions only by their (integer) ids. The mapping | ||
between the questions and ids is called a dictionary: | ||
|
||
>>> dictionary = corpora.Dictionary(texts) | ||
>>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference | ||
>>> print(dictionary) | ||
Dictionary(12 unique tokens) | ||
.. sourcecode:: pycon | ||
|
||
>>> from gensim import corpora | ||
>>> dictionary = corpora.Dictionary(texts) | ||
>>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference | ||
>>> print(dictionary) | ||
Dictionary(12 unique tokens) | ||
|
||
Here we assigned a unique integer id to all words appearing in the corpus with the | ||
:class:`gensim.corpora.dictionary.Dictionary` class. This sweeps across the texts, collecting word counts | ||
|
@@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory: | |
>>> # collect statistics about all tokens | ||
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we use a real corpus here? Also, we may want to show the best practice for string tokenization here. That way, we can doctest this file. |
||
>>> # remove stop words and words that appear only once | ||
>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist | ||
>>> if stopword in dictionary.token2id] | ||
>>> stop_ids = [ | ||
>>> dictionary.token2id[stopword] | ||
>>> for stopword in stoplist | ||
>>> if stopword in dictionary.token2id | ||
>>> ] | ||
>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1] | ||
>>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once | ||
>>> dictionary.compactify() # remove gaps in id sequence after words that were removed | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to demonstrate the best practice for string tokenization here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best practice is to use a dedicated library (e.g. NLTK)… do we want to complicate the tutorial?
Maybe we should.
But if we don't, this should be something really stupid (like it is now), so people don't accidentally copy-paste thinking it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more along the lines of https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's an option (but
utils.simple_preprocess
better),