Make any2utf8 optional in Phrases #1454

piskvorky · 2017-06-28T06:29:54Z

The any2utf8 conversion of input sentences into bytestrings turns out to be a bottleneck when training Phrases in pure Python.

The any2utf8 function does a unicode=>utf8 conversion on each word if the word is unicode, or a full utf8=>unicode=>utf8 conversion if it's a bytestring (code).

This PR makes that conversion optional, by adding a new recode_to_utf8 parameter. When True (default), this is the old behaviour, no change. When False, use the raw sentences as supplied by the user, perform no conversions at all (we expect the words are already bytestrings).

This PR is untested and needs reviewing / polishing. I wrote it to demonstrate what I mean by a cleaner solution, as the solution in PR #1413 was too convoluted.

prakhar2b · 2017-06-28T06:57:13Z

@piskvorky I already updated the PR #1413 with these changes as asked by @menshikh-iv
I have also added a test for this case , and it is working as expected now after lots of debugging

Also, I needed to mention, this would be my biggest takeaway from this gsoc - leading by example. Thanks 😄

piskvorky · 2017-06-29T08:23:04Z

@prakhar2b Yeah, there was a lot of confusion and wasted effort here, unfortunately. Sorry about that.

Let's work toward making the rest of the project more targeted, clearer and impactful ⚒️.

gojomo · 2017-06-29T10:40:38Z

As a matter of API/state complexity, it might make more sense to just say the class itself does no re-encoding, but will be most memory efficient if providing compact bytestring tokens - that is, make the default/contract free of this issue. Then, provide a utility function/generator for doing the encoding-juggling, if desired, and mention it with an example in the docstring.

This isn't quite as good for backward compatibility – people can't ignore the change, call as before, and get the same results. But perhaps there will be such big changes here, given other optimizations and possible new options (like approximate counting or the CommonGrams variant in another issue/PR) that they should expect to need to update their code. Perhaps the class should even be renamed to indicate a clear break with the past and new functionality. (I've always found the literal name Phrases a little odd, since it's a bunch more policy & state than just a list of phrases, and the bigrams it creates are actually something with a tighter meaning than the generic 'phrase' – not just grammatical units but specifically those that occur so frequently they should be analyzed together.) Maybe the common API is in a PhraseModel and then implementors/subclasses are named by their strategy/options.

piskvorky · 2017-06-29T12:58:17Z

The new implementation will use hashing so none of this should matter. I agree we should call it something else (Collocations?), and then phase the old Phrases out, to avoid confusion.

For now, I think keeping backward compatibility in Phrases is safer. I'm not opposed to just dropping the recoding altogether, but it seems too drastic for such a tiny change right now.

menshikh-iv · 2017-07-05T06:57:58Z

Continued in #1413

make any2utf8 optional in Phrases

96d32d4

prakhar2b mentioned this pull request Jun 29, 2017

[WIP] Phrases: make any2utf8 optional #1413

Closed

menshikh-iv closed this Jul 5, 2017

menshikh-iv deleted the any2utf8_phrases branch October 6, 2017 10:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make any2utf8 optional in Phrases #1454

Make any2utf8 optional in Phrases #1454

piskvorky commented Jun 28, 2017 •

edited

Loading

prakhar2b commented Jun 28, 2017 •

edited

Loading

piskvorky commented Jun 29, 2017 •

edited

Loading

gojomo commented Jun 29, 2017

piskvorky commented Jun 29, 2017 •

edited

Loading

menshikh-iv commented Jul 5, 2017

Make any2utf8 optional in Phrases #1454

Make any2utf8 optional in Phrases #1454

Conversation

piskvorky commented Jun 28, 2017 • edited Loading

prakhar2b commented Jun 28, 2017 • edited Loading

piskvorky commented Jun 29, 2017 • edited Loading

gojomo commented Jun 29, 2017

piskvorky commented Jun 29, 2017 • edited Loading

menshikh-iv commented Jul 5, 2017

piskvorky commented Jun 28, 2017 •

edited

Loading

prakhar2b commented Jun 28, 2017 •

edited

Loading

piskvorky commented Jun 29, 2017 •

edited

Loading

piskvorky commented Jun 29, 2017 •

edited

Loading