-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Phraser
memory usage (drop frequencies)
#2208
Conversation
Hello, thanks for PR @jenishah, can you add test to check than this change are compatible with old phrases? I mean you need to train & save phrases with old gensim code (3.6.0 without your change) and load with your new code and check than this works as expected. |
phrasegrams
)
I performed this test in my system, and it works fine. Is that enough, or am I missing something? |
@jenishah you should add it as a test here ;) Just add the old model file to |
Ping @jenishah, are you planning to finish PR? |
Yes, I will do it this weekend. |
gensim/test/test_phrases.py
Outdated
class TestPhraserModelCompatibilty(unittest.TestCase): | ||
|
||
def TestCompatibilty(self): | ||
with temporary_file("phraser_model_3dot6") as fpath: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a temporary file. You should use datapath
for retrieve path to model.
gensim/test/test_phrases.py
Outdated
prev_ver = bigram_loaded(test_sentences) | ||
phrase_model = Phrases(test_sentences, threshold=1) | ||
bigram = Phraser(phrase_model) | ||
curr_ver = bigram(test_sentences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but Phraser
isn't callable, [test_sentences]
should be used here (same for 655). How this pass tests?
2f0fcc8
to
4a96eab
Compare
gensim/models/phrases.py
Outdated
@@ -805,7 +805,7 @@ def __init__(self, phrases_model): | |||
for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True): | |||
if bigram in self.phrasegrams: | |||
logger.info('Phraser repeat %s', bigram) | |||
self.phrasegrams[bigram] = (phrases_model.vocab[self.delimiter.join(bigram)], score) | |||
self.phrasegrams[bigram] = (None, score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1: this doesn't solve the problem, tuples use a lot of memory (as does the extra None pointer). 80 extra bytes per entry, on my machine.
If you want to keep backward compatibility, check the dict format in load()
and convert from tuple to a simple float if necessary.
gensim/test/test_phrases.py
Outdated
@@ -14,7 +14,6 @@ | |||
import six | |||
|
|||
import numpy as np | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the blank please. We separate third party modules from internal modules by a blank line (another visual block).
The blank line between six
and numpy
shouldn't be there though :)
2f0fcc8
to
bba2e46
Compare
gensim/models/phrases.py
Outdated
@@ -848,7 +848,10 @@ def score_item(self, worda, wordb, components, scorer): | |||
|
|||
""" | |||
try: | |||
return self.phrasegrams[tuple(components)][1] | |||
if list(self.phrasegrams.values())[0].__class__ is tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this check is very tricky to just check that value is instance of tuple. Python has isinstance builtin function for that.
- creation of list of values is not easy operation. It would be better to avoid this.
if list(self.phrasegrams.values())[0].__class__ is tuple: | |
score = self.phrasegrams[tuple(components)] | |
return score[-1] if isinstance(score, tuple) else score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But isinstance has to be applied to dictionary value, which requires using list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jenishah yes, and as you can see in suggested change, we are getting it and saving in score
variable. We don't need a concrete value of dict, so we can just get that value which will be returned later.
gensim/models/phrases.py
Outdated
@@ -848,7 +848,11 @@ def score_item(self, worda, wordb, components, scorer): | |||
|
|||
""" | |||
try: | |||
return self.phrasegrams[tuple(components)][1] | |||
score = self.phrasegrams[tuple(components)] | |||
if isinstance(score, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to be checking the object version at run-time.
We want to check during load
and convert to new object format if necessary right during loading.
Please have a look at LdaModel.load for an example of "conditional" loading for backward compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better approach other than to look at the type of values() in phrasegrams
dict of the object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think that's fine: look at the type when loading the object, and if it's the "old" type (tuple), convert to the new type (floats) right away. Then at runtime, we continue working with the new type only.
gensim/models/phrases.py
Outdated
@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs): | |||
""" | |||
model = super(PhrasesTransformation, cls).load(*args, **kwargs) | |||
# update older models | |||
# if value in phrasegrams dict is a tuple, load only the scores. | |||
if len(model.__dict__['phrasegrams']): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if model.phrasegrams:
more Pythonic.
gensim/models/phrases.py
Outdated
@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs): | |||
""" | |||
model = super(PhrasesTransformation, cls).load(*args, **kwargs) | |||
# update older models | |||
# if value in phrasegrams dict is a tuple, load only the scores. | |||
if len(model.__dict__['phrasegrams']): | |||
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More readable:
if model.phrasegrams:
first_value = list(model.phrasegrams.values())[0]
if isinstance(first_value, tuple):
…
gensim/models/phrases.py
Outdated
# if value in phrasegrams dict is a tuple, load only the scores. | ||
if len(model.__dict__['phrasegrams']): | ||
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple): | ||
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to read. Can you rephrase this with normal dict access syntax, and using better variable names? (not k
and v
and v[1]
).
Also, changing the dict at the same time you're iterating over it sounds tricky. It may be safer to iterate over each value, and if it's a tuple, change it right away (in-place assignment, without a full dict copy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any case where some values are tuple and some are not?
If not, we can check only one instance and based on that modify the entire dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so -- all values should always be in the same format (all old, all new).
My suggestion was more around being careful with modify-while-iterating and creating-a-copy-of-large-dict. So whether you check each value or check just once makes little difference.
In fact, per-value checks may be a bit cleaner, because there's no need for the special case of "dictionary is empty", or materializing all the values into a list only to pick the first one.
gensim/models/phrases.py
Outdated
@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs): | |||
# update older models | |||
# if value in phrasegrams dict is a tuple, load only the scores. | |||
try: | |||
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple): | |||
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items()) | |||
for components, scores in model.__dict__['phrasegrams'].items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this strange __dict__
access? Why not use the pattern I showed in my last review?
And what is the try
for, what KeyError
are we guarding against? Please add code comments.
gensim/models/phrases.py
Outdated
@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs): | |||
# update older models | |||
# if value in phrasegrams dict is a tuple, load only the scores. | |||
try: | |||
if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple): | |||
model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items()) | |||
for components, scores in model.__dict__['phrasegrams'].items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really work? It mutates the collection it's iterating over, which is usually a bad idea.
As mentioned previously, I'd make a copy of the original keys (not items) and iterate over that, while mutating the original (large) dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking up, good progress!
gensim/models/phrases.py
Outdated
pass | ||
if model.phrasegrams: | ||
components = model.phrasegrams.keys() | ||
for component in components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are better merged into one (so the temporary variable is released as soon as it's not needed).
gensim/models/phrases.py
Outdated
for component in components: | ||
score = model.phrasegrams[component] | ||
if isinstance(score, tuple): | ||
model.phrasegrams[component] = score[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deserves a comment: what is score[1]
?
Or even better, unroll the tuple into properly named variables (x, y, z = score
) and then assign that.
phrasegrams
)Phraser
memory usage (drop frequency from phrasegrams
)
Phraser
memory usage (drop frequency from phrasegrams
)Phraser
memory usage (drop frequencies)
thanks for work @jenishah 💪 |
Fix #2189 , removed the frequency from the
phrasregrams
dict and keeping it as a dict of float, that keeps the scores only.