-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Phrases: make any2utf8 optional #1413
Closed
Closed
Changes from 36 commits
Commits
Show all changes
41 commits
Select commit
Hold shift + click to select a range
b56b801
[TDD] test for Phrases ave load
089bad3
any2utf8 before save only
c0a1a79
any2utf8 on entire sentence instead of each words separately
180d278
pep8 fixes
2cbb840
rsolved python3 byte error
f87a6a1
resolving python3 error
4445ac2
resolving python3 error
89fe6ad
resolving python3 bytestring error
82dbbf9
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
03acd0b
add assert for length after any2utf8
d8dc744
pep8 fixes
ad69bd5
delimiter not punctuation now
c7a885d
pep8 fixes
prakhar2b 34ded9b
phrases optimization benchmark
aebe1c4
[TDD] remove valueError test for min_count
02aa404
[TDD] test for recode_to_utf8 False
13f26ae
any2utf8 optional in phrases
22c9fbe
refactor codes for delimiter unicode conversion when no any2utf8
4c6d8bb
pep8 fixes
aa5e5c3
updated benchmark for recode_to_utf8=False
9264329
recode_to_utf8 for both bytestring and unicode input
b81817c
added recode_to_utf8 to docstring
db376eb
detect encoding of corpus using next and iter
f103f8f
[TDD] test that phrases works for both bytestring and unicode input f…
580504a
[TDD] test that phraser works for both bytestring and unicode input f…
6088bf7
add support for both bytestring and unicode input for recode_to_utf8=…
a3fd479
corrected docstring for recode_to_utf8
5b70ec9
pep8 fixes
a8a0004
detect encoding of input stream using next and iter
3bd4c03
check for empty sentences before checking for encoding
d1771df
removed check and test for bad parameter
05a24d1
docstring and comments modified
a28ef32
put is_nput_bytes and encoding check in learn_vocab instead of init
86fde36
updated docstring for recode_to_utf8
16c4696
[TDD] failing test for empty list or generator as input
dfcde96
raises valueError for empty list or generator as input
c0d17c4
empty sentence not a special case, no exception or warning now
3c2e1cd
specific exception and debug log added for empty list/generator input
7792e09
converted the streamed iterable to an in-memory list for benchmark
0e4d862
modified debug message for empty list input
302f7f3
no implicit conversion for infer input if recode_to_utf8=False
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,373 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Performance improvement in Phrases module\n", | ||
"\n", | ||
"#### Author - Prakhar Pratyush (@prakhar2b)\n", | ||
"[Google summer of code '17 live blog](https://rare-technologies.com/google-summer-of-code-2017-live-blog-performance-improvement-in-gensim-and-fasttext/)\n", | ||
"\n", | ||
"| Optimization | Python 2.7 | Python 3.6 | PR |\n", | ||
"| ------------- |:-------------:| :------------:|\n", | ||
"| original | ~ 36-38 sec | ~32-34 sec |\n", | ||
"|recode_to_utf8=False| ~19-21 sec | ~20-22 sec | [#1413](https://github.com/RaRe-Technologies/gensim/pull/1413)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Python 3.6.1 :: Anaconda 4.4.0 (64-bit)\r\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"! python --version" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"2017-06-29 13:19:26,967 : INFO : 'pattern' package not found; tag filters are not available for English\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import logging\n", | ||
"logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n", | ||
"\n", | ||
"import profile\n", | ||
"%load_ext autoreload\n", | ||
"\n", | ||
"import gensim\n", | ||
"from gensim.models.word2vec import Text8Corpus\n", | ||
"%autoreload" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#! git clone https://github.com/prakhar2b/gensim.git" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"/home/prakhar\r\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pwd" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#! wget http://mattmahoney.net/dc/text8.zip " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#! unzip text8.zip" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"/home/prakhar/text8\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import os\n", | ||
"text8_file = os.path.abspath('text8')\n", | ||
"print(text8_file)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"/home/prakhar/gensim\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"% cd gensim" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Already on 'develop'\r\n", | ||
"Your branch is up-to-date with 'origin/develop'.\r\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!git checkout develop\n", | ||
"%autoreload" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"! python setup.py install\n", | ||
"%autoreload " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"2017-06-29 13:13:36,521 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:13:36,524 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:14:09,283 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:14:09,284 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:14:09,385 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:14:09,387 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:14:41,861 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:14:41,863 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:14:41,974 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:14:41,976 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:15:15,134 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:15:15,135 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:15:15,238 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:15:15,240 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:15:54,512 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:15:54,513 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:15:54,612 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:15:54,615 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:16:30,985 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:16:30,986 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"1 loop, best of 3: 33.2 s per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# currently on develop --- original code\n", | ||
"from gensim.models import Phrases\n", | ||
"bigram = Phrases(Text8Corpus(text8_file))\n", | ||
"%timeit bigram = Phrases(Text8Corpus(text8_file))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Switched to branch 'any2utf8'\r\n", | ||
"Your branch is up-to-date with 'origin/any2utf8'.\r\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"%autoreload \n", | ||
"! git checkout any2utf8\n", | ||
"%autoreload" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"! python setup.py install\n", | ||
"%autoreload " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"2017-06-29 13:19:38,063 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:19:38,074 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:20:08,504 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:20:08,505 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:20:08,508 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:20:08,512 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:20:40,463 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:20:40,464 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:20:40,546 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:20:40,549 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:21:16,204 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:21:16,205 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:21:16,305 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:21:16,308 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:21:53,123 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:21:53,124 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:21:53,229 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:21:53,232 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:22:27,118 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:22:27,119 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"1 loop, best of 3: 33.9 s per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# currently on any2utf8 \n", | ||
"from gensim.models import Phrases\n", | ||
"bigram = Phrases(Text8Corpus(text8_file))\n", | ||
"%timeit bigram = Phrases(Text8Corpus(text8_file))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"2017-06-29 13:25:04,268 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:25:04,275 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:25:26,068 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:25:26,070 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:25:26,187 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:25:26,189 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:25:47,507 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:25:47,508 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:25:47,621 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:25:47,625 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:26:09,264 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:26:09,266 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:26:09,386 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:26:09,389 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:26:30,828 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:26:30,829 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n", | ||
"2017-06-29 13:26:30,947 : INFO : collecting all words and their counts\n", | ||
"2017-06-29 13:26:30,950 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types\n", | ||
"2017-06-29 13:26:52,900 : INFO : collected 4400410 word types from a corpus of 17003506 words (unigram + bigrams) and 1701 sentences\n", | ||
"2017-06-29 13:26:52,901 : INFO : using 4400410 counts as vocab in Phrases<0 vocab, min_count=5, threshold=10.0, max_vocab_size=40000000>\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"1 loop, best of 3: 21.4 s per loop\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from gensim.models import Phrases\n", | ||
"bigram = Phrases(Text8Corpus(text8_file), recode_to_utf8= False)\n", | ||
"%timeit bigram = Phrases(Text8Corpus(text8_file), recode_to_utf8= False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.1" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better convert the streamed iterable to an in-memory list (using
list()
), it's small enough. That way we don't have to iterate over the file from disk every time.This will make the benchmark conclusions stronger (less noise and delays from other, unrelated parts of the code, IO overhead etc).