Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression bug in tutorial #2399

Merged
merged 1 commit into from
Mar 20, 2019
Merged

Fix regression bug in tutorial #2399

merged 1 commit into from
Mar 20, 2019

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Mar 2, 2019

There is a minor bug in the Corpora and vector spaces: The import of corpora got removed by in #2192 here (I assume by accident).

Identified by a confused user on our mailing list.

This PR reconstitutes the import, plus fixes the code style since I was at it.

@piskvorky piskvorky requested a review from mpenkov March 2, 2019 19:33
@piskvorky piskvorky added the 3.7.2 label Mar 2, 2019
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>> for document in documents]
>>> texts = [
>>> [word for word in document.lower().split() if word not in stoplist]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to demonstrate the best practice for string tokenization here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best practice is to use a dedicated library (e.g. NLTK)… do we want to complicate the tutorial?

Maybe we should.

But if we don't, this should be something really stupid (like it is now), so people don't accidentally copy-paste thinking it's a good idea.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more along the lines of https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize

WDYT?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's an option (but utils.simple_preprocess better),

>>> texts = [
>>> [word for word in document.lower().split() if word not in stoplist]
>>> for document in documents
>>> ]
>>>
>>> # remove words that appear only once
>>> frequency = defaultdict(int)
Copy link
Collaborator

@mpenkov mpenkov Mar 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't part of the PR, but it may be better to simplify this section as:

# untested code but "should" work
import collections
frequency = collections.Counter()
for text in texts:
    frequency.update(text)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

@@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory:
>>> # collect statistics about all tokens
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
Copy link
Collaborator

@mpenkov mpenkov Mar 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a real corpus here? Also, we may want to show the best practice for string tokenization here.

That way, we can doctest this file.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left some minor comments about potential improvements. If they're out of scope for the current PR, it may be good to capture them in a separate issue so that they get done eventually. Let me know what you think.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 20, 2019

There's a ton of improvements we could make to the tutorials, but they're out of scope for this PR. I opened #2424 instead.

@mpenkov mpenkov merged commit 3fa7cd5 into develop Mar 20, 2019
@mpenkov mpenkov deleted the piskvorky-patch-1 branch March 20, 2019 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants