-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Phrases: make any2utf8 optional #1413
Closed
Closed
Changes from 7 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
b56b801
[TDD] test for Phrases ave load
089bad3
any2utf8 before save only
c0a1a79
any2utf8 on entire sentence instead of each words separately
180d278
pep8 fixes
2cbb840
rsolved python3 byte error
f87a6a1
resolving python3 error
4445ac2
resolving python3 error
89fe6ad
resolving python3 bytestring error
82dbbf9
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
03acd0b
add assert for length after any2utf8
d8dc744
pep8 fixes
ad69bd5
delimiter not punctuation now
c7a885d
pep8 fixes
prakhar2b 34ded9b
phrases optimization benchmark
aebe1c4
[TDD] remove valueError test for min_count
02aa404
[TDD] test for recode_to_utf8 False
13f26ae
any2utf8 optional in phrases
22c9fbe
refactor codes for delimiter unicode conversion when no any2utf8
4c6d8bb
pep8 fixes
aa5e5c3
updated benchmark for recode_to_utf8=False
9264329
recode_to_utf8 for both bytestring and unicode input
b81817c
added recode_to_utf8 to docstring
db376eb
detect encoding of corpus using next and iter
f103f8f
[TDD] test that phrases works for both bytestring and unicode input f…
580504a
[TDD] test that phraser works for both bytestring and unicode input f…
6088bf7
add support for both bytestring and unicode input for recode_to_utf8=…
a3fd479
corrected docstring for recode_to_utf8
5b70ec9
pep8 fixes
a8a0004
detect encoding of input stream using next and iter
3bd4c03
check for empty sentences before checking for encoding
d1771df
removed check and test for bad parameter
05a24d1
docstring and comments modified
a28ef32
put is_nput_bytes and encoding check in learn_vocab instead of init
86fde36
updated docstring for recode_to_utf8
16c4696
[TDD] failing test for empty list or generator as input
dfcde96
raises valueError for empty list or generator as input
c0d17c4
empty sentence not a special case, no exception or warning now
3c2e1cd
specific exception and debug log added for empty list/generator input
7792e09
converted the streamed iterable to an in-memory list for benchmark
0e4d862
modified debug message for empty list input
302f7f3
no implicit conversion for infer input if recode_to_utf8=False
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few issues here -
You are trying to split a bytestring (the result of the
any2utf8
call) by'_'
- this will not work in python3+ because literal strings are unicode by default. You've faced similar problems previously, so I think it would be helpful to understand character encodings at a conceptual level, and the differences between string handling in python2 and 3.Simply
sentences = utils.any2utf8(u'_'.join(sentence)).split('_'))
would be enough - no need for the extra[w for w in ...]
We're not accounting for the possibility that a word in the sentence contains
'_'
here - it would be wrong to make implicit assumptions like these about user input, unless there was an explicit constraint in the API. Escaping could be an option - although I'm not sure it is feasible, performance-wise.