Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Phraser memory usage (drop frequencies) #2208

Merged
merged 12 commits into from
Jan 11, 2019

Conversation

jenishah
Copy link
Contributor

@jenishah jenishah commented Oct 3, 2018

Fix #2189 , removed the frequency from the phrasregrams dict and keeping it as a dict of float, that keeps the scores only.

@menshikh-iv
Copy link
Contributor

Hello, thanks for PR @jenishah, can you add test to check than this change are compatible with old phrases? I mean you need to train & save phrases with old gensim code (3.6.0 without your change) and load with your new code and check than this works as expected.

@menshikh-iv menshikh-iv changed the title drop frequency from phrasegrams Reduce phrases memory usage (drop frequency from phrasegrams) Oct 4, 2018
@jenishah
Copy link
Contributor Author

jenishah commented Oct 6, 2018

Hello, thanks for PR @jenishah, can you add test to check than this change are compatible with old phrases? I mean you need to train & save phrases with old gensim code (3.6.0 without your change) and load with your new code and check than this works as expected.

I performed this test in my system, and it works fine. Is that enough, or am I missing something?

@menshikh-iv
Copy link
Contributor

@jenishah you should add it as a test here ;) Just add the old model file to gensim/test/test_data + add the test where you load this model and use it (apply to some text)

@menshikh-iv
Copy link
Contributor

Ping @jenishah, are you planning to finish PR?

@jenishah
Copy link
Contributor Author

Ping @jenishah, are you planning to finish PR?

Yes, I will do it this weekend.

class TestPhraserModelCompatibilty(unittest.TestCase):

def TestCompatibilty(self):
with temporary_file("phraser_model_3dot6") as fpath:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a temporary file. You should use datapath for retrieve path to model.

prev_ver = bigram_loaded(test_sentences)
phrase_model = Phrases(test_sentences, threshold=1)
bigram = Phraser(phrase_model)
curr_ver = bigram(test_sentences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but Phraser isn't callable, [test_sentences] should be used here (same for 655). How this pass tests?

@@ -805,7 +805,7 @@ def __init__(self, phrases_model):
for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
if bigram in self.phrasegrams:
logger.info('Phraser repeat %s', bigram)
self.phrasegrams[bigram] = (phrases_model.vocab[self.delimiter.join(bigram)], score)
self.phrasegrams[bigram] = (None, score)
Copy link
Owner

@piskvorky piskvorky Oct 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1: this doesn't solve the problem, tuples use a lot of memory (as does the extra None pointer). 80 extra bytes per entry, on my machine.

If you want to keep backward compatibility, check the dict format in load() and convert from tuple to a simple float if necessary.

@@ -14,7 +14,6 @@
import six

import numpy as np

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the blank please. We separate third party modules from internal modules by a blank line (another visual block).

The blank line between six and numpy shouldn't be there though :)

@@ -848,7 +848,10 @@ def score_item(self, worda, wordb, components, scorer):

"""
try:
return self.phrasegrams[tuple(components)][1]
if list(self.phrasegrams.values())[0].__class__ is tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this check is very tricky to just check that value is instance of tuple. Python has isinstance builtin function for that.
  2. creation of list of values is not easy operation. It would be better to avoid this.
Suggested change
if list(self.phrasegrams.values())[0].__class__ is tuple:
score = self.phrasegrams[tuple(components)]
return score[-1] if isinstance(score, tuple) else score

Copy link
Contributor Author

@jenishah jenishah Nov 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But isinstance has to be applied to dictionary value, which requires using list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenishah yes, and as you can see in suggested change, we are getting it and saving in score variable. We don't need a concrete value of dict, so we can just get that value which will be returned later.

@@ -848,7 +848,11 @@ def score_item(self, worda, wordb, components, scorer):

"""
try:
return self.phrasegrams[tuple(components)][1]
score = self.phrasegrams[tuple(components)]
if isinstance(score, tuple):
Copy link
Owner

@piskvorky piskvorky Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to be checking the object version at run-time.

We want to check during load and convert to new object format if necessary right during loading.

Please have a look at LdaModel.load for an example of "conditional" loading for backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any better approach other than to look at the type of values() in phrasegrams dict of the object?

Copy link
Owner

@piskvorky piskvorky Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think that's fine: look at the type when loading the object, and if it's the "old" type (tuple), convert to the new type (floats) right away. Then at runtime, we continue working with the new type only.

@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs):
"""
model = super(PhrasesTransformation, cls).load(*args, **kwargs)
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
if len(model.__dict__['phrasegrams']):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model.phrasegrams: more Pythonic.

@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs):
"""
model = super(PhrasesTransformation, cls).load(*args, **kwargs)
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
if len(model.__dict__['phrasegrams']):
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More readable:

if model.phrasegrams:
    first_value = list(model.phrasegrams.values())[0]
    if isinstance(first_value, tuple):
        …

# if value in phrasegrams dict is a tuple, load only the scores.
if len(model.__dict__['phrasegrams']):
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
Copy link
Owner

@piskvorky piskvorky Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read. Can you rephrase this with normal dict access syntax, and using better variable names? (not k and v and v[1]).

Also, changing the dict at the same time you're iterating over it sounds tricky. It may be safer to iterate over each value, and if it's a tuple, change it right away (in-place assignment, without a full dict copy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any case where some values are tuple and some are not?
If not, we can check only one instance and based on that modify the entire dictionary.

Copy link
Owner

@piskvorky piskvorky Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so -- all values should always be in the same format (all old, all new).

My suggestion was more around being careful with modify-while-iterating and creating-a-copy-of-large-dict. So whether you check each value or check just once makes little difference.

In fact, per-value checks may be a bit cleaner, because there's no need for the special case of "dictionary is empty", or materializing all the values into a list only to pick the first one.

@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs):
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
try:
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
for components, scores in model.__dict__['phrasegrams'].items():
Copy link
Owner

@piskvorky piskvorky Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this strange __dict__ access? Why not use the pattern I showed in my last review?

And what is the try for, what KeyError are we guarding against? Please add code comments.

@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs):
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
try:
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
for components, scores in model.__dict__['phrasegrams'].items():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really work? It mutates the collection it's iterating over, which is usually a bad idea.

As mentioned previously, I'd make a copy of the original keys (not items) and iterate over that, while mutating the original (large) dict.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking up, good progress!

pass
if model.phrasegrams:
components = model.phrasegrams.keys()
for component in components:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines are better merged into one (so the temporary variable is released as soon as it's not needed).

for component in components:
score = model.phrasegrams[component]
if isinstance(score, tuple):
model.phrasegrams[component] = score[1]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserves a comment: what is score[1]?

Or even better, unroll the tuple into properly named variables (x, y, z = score) and then assign that.

@menshikh-iv menshikh-iv changed the title Reduce phrases memory usage (drop frequency from phrasegrams) Reduce Phraser memory usage (drop frequency from phrasegrams) Jan 10, 2019
@menshikh-iv menshikh-iv changed the title Reduce Phraser memory usage (drop frequency from phrasegrams) Reduce Phraser memory usage (drop frequencies) Jan 10, 2019
@menshikh-iv menshikh-iv merged commit c5a8f73 into piskvorky:develop Jan 11, 2019
@menshikh-iv
Copy link
Contributor

thanks for work @jenishah 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants