Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Phraser memory usage (drop frequencies) #2208

Merged
merged 12 commits into from
Jan 11, 2019
12 changes: 10 additions & 2 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,10 @@ def load(cls, *args, **kwargs):
"""
model = super(PhrasesTransformation, cls).load(*args, **kwargs)
# update older models
# if value in phrasegrams dict is a tuple, load only the scores.
if len(model.__dict__['phrasegrams']):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if model.phrasegrams: more Pythonic.

if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More readable:

if model.phrasegrams:
    first_value = list(model.phrasegrams.values())[0]
    if isinstance(first_value, tuple):
        …

model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
Copy link
Owner

@piskvorky piskvorky Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to read. Can you rephrase this with normal dict access syntax, and using better variable names? (not k and v and v[1]).

Also, changing the dict at the same time you're iterating over it sounds tricky. It may be safer to iterate over each value, and if it's a tuple, change it right away (in-place assignment, without a full dict copy).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be any case where some values are tuple and some are not?
If not, we can check only one instance and based on that modify the entire dictionary.

Copy link
Owner

@piskvorky piskvorky Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so -- all values should always be in the same format (all old, all new).

My suggestion was more around being careful with modify-while-iterating and creating-a-copy-of-large-dict. So whether you check each value or check just once makes little difference.

In fact, per-value checks may be a bit cleaner, because there's no need for the special case of "dictionary is empty", or materializing all the values into a list only to pick the first one.

# if no scoring parameter, use default scoring
if not hasattr(model, 'scoring'):
logger.info('older version of %s loaded without scoring function', cls.__name__)
Expand Down Expand Up @@ -805,7 +809,7 @@ def __init__(self, phrases_model):
for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
if bigram in self.phrasegrams:
logger.info('Phraser repeat %s', bigram)
self.phrasegrams[bigram] = (phrases_model.vocab[self.delimiter.join(bigram)], score)
self.phrasegrams[bigram] = score
count += 1
if not count % 50000:
logger.info('Phraser added %i phrasegrams', count)
Expand Down Expand Up @@ -848,7 +852,11 @@ def score_item(self, worda, wordb, components, scorer):

"""
try:
return self.phrasegrams[tuple(components)][1]
score = self.phrasegrams[tuple(components)]
if isinstance(score, tuple):
Copy link
Owner

@piskvorky piskvorky Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to be checking the object version at run-time.

We want to check during load and convert to new object format if necessary right during loading.

Please have a look at LdaModel.load for an example of "conditional" loading for backward compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any better approach other than to look at the type of values() in phrasegrams dict of the object?

Copy link
Owner

@piskvorky piskvorky Nov 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think that's fine: look at the type when loading the object, and if it's the "old" type (tuple), convert to the new type (floats) right away. Then at runtime, we continue working with the new type only.

return score[1]
else:
return score
except KeyError:
return -1

Expand Down
Binary file added gensim/test/test_data/phraser_model_3dot6
Binary file not shown.
11 changes: 10 additions & 1 deletion gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
import unittest

import six

import numpy as np

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the blank please. We separate third party modules from internal modules by a blank line (another visual block).

The blank line between six and numpy shouldn't be there though :)

from gensim.utils import to_unicode
Expand Down Expand Up @@ -646,6 +645,16 @@ def testEncoding(self):
self.assertTrue(isinstance(transformed, six.text_type))


class TestPhraserModelCompatibilty(unittest.TestCase):

def testCompatibilty(self):
bigram_loaded = Phraser.load(datapath("phraser_model_3dot6"))
test_sentences = [u'trees', u'graph', u'minors']
prev_ver = bigram_loaded[test_sentences]
expected_res = ['trees_graph', 'minors']
self.assertEqual(prev_ver, expected_res)


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
unittest.main()