Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Phrases: make any2utf8 optional #1413

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
b56b801
[TDD] test for Phrases ave load
Jun 14, 2017
089bad3
any2utf8 before save only
Jun 14, 2017
c0a1a79
any2utf8 on entire sentence instead of each words separately
Jun 21, 2017
180d278
pep8 fixes
Jun 21, 2017
2cbb840
rsolved python3 byte error
Jun 21, 2017
f87a6a1
resolving python3 error
Jun 21, 2017
4445ac2
resolving python3 error
Jun 21, 2017
89fe6ad
resolving python3 bytestring error
Jun 23, 2017
82dbbf9
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jun 23, 2017
03acd0b
add assert for length after any2utf8
Jun 23, 2017
d8dc744
pep8 fixes
Jun 23, 2017
ad69bd5
delimiter not punctuation now
Jun 26, 2017
c7a885d
pep8 fixes
prakhar2b Jun 26, 2017
34ded9b
phrases optimization benchmark
Jun 27, 2017
aebe1c4
[TDD] remove valueError test for min_count
Jun 28, 2017
02aa404
[TDD] test for recode_to_utf8 False
Jun 28, 2017
13f26ae
any2utf8 optional in phrases
Jun 28, 2017
22c9fbe
refactor codes for delimiter unicode conversion when no any2utf8
Jun 28, 2017
4c6d8bb
pep8 fixes
Jun 28, 2017
aa5e5c3
updated benchmark for recode_to_utf8=False
Jun 29, 2017
9264329
recode_to_utf8 for both bytestring and unicode input
Jun 29, 2017
b81817c
added recode_to_utf8 to docstring
Jun 29, 2017
db376eb
detect encoding of corpus using next and iter
Jun 29, 2017
f103f8f
[TDD] test that phrases works for both bytestring and unicode input f…
Jun 29, 2017
580504a
[TDD] test that phraser works for both bytestring and unicode input f…
Jun 29, 2017
6088bf7
add support for both bytestring and unicode input for recode_to_utf8=…
Jun 29, 2017
a3fd479
corrected docstring for recode_to_utf8
Jun 29, 2017
5b70ec9
pep8 fixes
Jun 29, 2017
a8a0004
detect encoding of input stream using next and iter
Jun 29, 2017
3bd4c03
check for empty sentences before checking for encoding
Jun 29, 2017
d1771df
removed check and test for bad parameter
Jun 30, 2017
05a24d1
docstring and comments modified
Jun 30, 2017
a28ef32
put is_nput_bytes and encoding check in learn_vocab instead of init
Jun 30, 2017
86fde36
updated docstring for recode_to_utf8
Jun 30, 2017
16c4696
[TDD] failing test for empty list or generator as input
Jun 30, 2017
dfcde96
raises valueError for empty list or generator as input
Jun 30, 2017
c0d17c4
empty sentence not a special case, no exception or warning now
Jul 6, 2017
3c2e1cd
specific exception and debug log added for empty list/generator input
Jul 6, 2017
7792e09
converted the streamed iterable to an in-memory list for benchmark
Jul 6, 2017
0e4d862
modified debug message for empty list input
Jul 11, 2017
302f7f3
no implicit conversion for infer input if recode_to_utf8=False
Jul 11, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,9 @@ def learn_vocab(sentences, max_vocab_size, delimiter=b'_', progress_per=10000):
if sentence_no % progress_per == 0:
logger.info("PROGRESS: at sentence #%i, processed %i words and %i word types" %
(sentence_no, total_words, len(vocab)))
sentence = [utils.any2utf8(w) for w in sentence]

sentence = [w for w in (utils.any2utf8(u'_'.join(sentence)).split('_'))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues here -

  1. You are trying to split a bytestring (the result of the any2utf8 call) by '_' - this will not work in python3+ because literal strings are unicode by default. You've faced similar problems previously, so I think it would be helpful to understand character encodings at a conceptual level, and the differences between string handling in python2 and 3.

  2. Simply sentences = utils.any2utf8(u'_'.join(sentence)).split('_')) would be enough - no need for the extra [w for w in ...]

  3. We're not accounting for the possibility that a word in the sentence contains '_' here - it would be wrong to make implicit assumptions like these about user input, unless there was an explicit constraint in the API. Escaping could be an option - although I'm not sure it is feasible, performance-wise.


for bigram in zip(sentence, sentence[1:]):
vocab[bigram[0]] += 1
vocab[delimiter.join(bigram)] += 1
Expand Down Expand Up @@ -227,7 +229,7 @@ def export_phrases(self, sentences, out_delimiter=b' ', as_tuples=False):
then you can debug the threshold with generated tsv
"""
for sentence in sentences:
s = [utils.any2utf8(w) for w in sentence]
s = [w for w in (utils.any2utf8(u'_'.join(sentence)).split('_'))]
last_bigram = False
vocab = self.vocab
threshold = self.threshold
Expand Down
1 change: 1 addition & 0 deletions gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ def testPruning(self):
"""Test that max_vocab_size parameter is respected."""
bigram = Phrases(sentences, max_vocab_size=5)
self.assertTrue(len(bigram.vocab) <= 5)

#endclass TestPhrasesModel


Expand Down