Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make any2utf8 optional in Phrases #1454

Closed
wants to merge 1 commit into from
Closed

Conversation

piskvorky
Copy link
Owner

@piskvorky piskvorky commented Jun 28, 2017

The any2utf8 conversion of input sentences into bytestrings turns out to be a bottleneck when training Phrases in pure Python.

The any2utf8 function does a unicode=>utf8 conversion on each word if the word is unicode, or a full utf8=>unicode=>utf8 conversion if it's a bytestring (code).

This PR makes that conversion optional, by adding a new recode_to_utf8 parameter. When True (default), this is the old behaviour, no change. When False, use the raw sentences as supplied by the user, perform no conversions at all (we expect the words are already bytestrings).

This PR is untested and needs reviewing / polishing. I wrote it to demonstrate what I mean by a cleaner solution, as the solution in PR #1413 was too convoluted.

@prakhar2b
Copy link
Contributor

prakhar2b commented Jun 28, 2017

@piskvorky I already updated the PR #1413 with these changes as asked by @menshikh-iv
I have also added a test for this case , and it is working as expected now after lots of debugging

Also, I needed to mention, this would be my biggest takeaway from this gsoc - leading by example. Thanks 😄

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 29, 2017

@prakhar2b Yeah, there was a lot of confusion and wasted effort here, unfortunately. Sorry about that.

Let's work toward making the rest of the project more targeted, clearer and impactful ⚒️.

@gojomo
Copy link
Collaborator

gojomo commented Jun 29, 2017

As a matter of API/state complexity, it might make more sense to just say the class itself does no re-encoding, but will be most memory efficient if providing compact bytestring tokens - that is, make the default/contract free of this issue. Then, provide a utility function/generator for doing the encoding-juggling, if desired, and mention it with an example in the docstring.

This isn't quite as good for backward compatibility – people can't ignore the change, call as before, and get the same results. But perhaps there will be such big changes here, given other optimizations and possible new options (like approximate counting or the CommonGrams variant in another issue/PR) that they should expect to need to update their code. Perhaps the class should even be renamed to indicate a clear break with the past and new functionality. (I've always found the literal name Phrases a little odd, since it's a bunch more policy & state than just a list of phrases, and the bigrams it creates are actually something with a tighter meaning than the generic 'phrase' – not just grammatical units but specifically those that occur so frequently they should be analyzed together.) Maybe the common API is in a PhraseModel and then implementors/subclasses are named by their strategy/options.

@piskvorky
Copy link
Owner Author

piskvorky commented Jun 29, 2017

The new implementation will use hashing so none of this should matter. I agree we should call it something else (Collocations?), and then phase the old Phrases out, to avoid confusion.

For now, I think keeping backward compatibility in Phrases is safer. I'm not opposed to just dropping the recoding altogether, but it seems too drastic for such a tiny change right now.

@menshikh-iv
Copy link
Contributor

Continued in #1413

@menshikh-iv menshikh-iv closed this Jul 5, 2017
@menshikh-iv menshikh-iv deleted the any2utf8_phrases branch October 6, 2017 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants