Make any2utf8 optional in Phrases #1454
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
any2utf8
conversion of input sentences into bytestrings turns out to be a bottleneck when trainingPhrases
in pure Python.The any2utf8 function does a
unicode=>utf8
conversion on each word if the word is unicode, or a fullutf8=>unicode=>utf8
conversion if it's a bytestring (code).This PR makes that conversion optional, by adding a new
recode_to_utf8
parameter. When True (default), this is the old behaviour, no change. When False, use the raw sentences as supplied by the user, perform no conversions at all (we expect the words are already bytestrings).This PR is untested and needs reviewing / polishing. I wrote it to demonstrate what I mean by a cleaner solution, as the solution in PR #1413 was too convoluted.