Reduce `Phraser` memory usage (drop frequencies) #2208

jenishah · 2018-10-03T10:21:51Z

Fix #2189 , removed the frequency from the phrasregrams dict and keeping it as a dict of float, that keeps the scores only.

menshikh-iv · 2018-10-04T02:05:39Z

Hello, thanks for PR @jenishah, can you add test to check than this change are compatible with old phrases? I mean you need to train & save phrases with old gensim code (3.6.0 without your change) and load with your new code and check than this works as expected.

jenishah · 2018-10-06T12:25:47Z

Hello, thanks for PR @jenishah, can you add test to check than this change are compatible with old phrases? I mean you need to train & save phrases with old gensim code (3.6.0 without your change) and load with your new code and check than this works as expected.

I performed this test in my system, and it works fine. Is that enough, or am I missing something?

menshikh-iv · 2018-10-06T14:07:31Z

@jenishah you should add it as a test here ;) Just add the old model file to gensim/test/test_data + add the test where you load this model and use it (apply to some text)

menshikh-iv · 2018-10-12T04:10:39Z

Ping @jenishah, are you planning to finish PR?

jenishah · 2018-10-12T05:22:47Z

Ping @jenishah, are you planning to finish PR?

Yes, I will do it this weekend.

menshikh-iv · 2018-10-15T04:47:28Z

gensim/test/test_phrases.py

+class TestPhraserModelCompatibilty(unittest.TestCase):
+
+    def TestCompatibilty(self):
+        with temporary_file("phraser_model_3dot6") as fpath:


This isn't a temporary file. You should use datapath for retrieve path to model.

menshikh-iv · 2018-10-15T04:51:49Z

gensim/test/test_phrases.py

+            prev_ver = bigram_loaded(test_sentences)
+            phrase_model = Phrases(test_sentences, threshold=1)
+            bigram = Phraser(phrase_model)
+            curr_ver = bigram(test_sentences)


but Phraser isn't callable, [test_sentences] should be used here (same for 655). How this pass tests?

piskvorky · 2018-10-21T12:40:43Z

gensim/models/phrases.py

@@ -805,7 +805,7 @@ def __init__(self, phrases_model):
        for bigram, score in phrases_model.export_phrases(corpus, self.delimiter, as_tuples=True):
            if bigram in self.phrasegrams:
                logger.info('Phraser repeat %s', bigram)
-            self.phrasegrams[bigram] = (phrases_model.vocab[self.delimiter.join(bigram)], score)
+            self.phrasegrams[bigram] = (None, score)


-1: this doesn't solve the problem, tuples use a lot of memory (as does the extra None pointer). 80 extra bytes per entry, on my machine.

If you want to keep backward compatibility, check the dict format in load() and convert from tuple to a simple float if necessary.

piskvorky · 2018-10-21T12:41:26Z

gensim/test/test_phrases.py

@@ -14,7 +14,6 @@
 import six

 import numpy as np
-


Keep the blank please. We separate third party modules from internal modules by a blank line (another visual block).

The blank line between six and numpy shouldn't be there though :)

horpto · 2018-11-08T23:55:09Z

gensim/models/phrases.py

@@ -848,7 +848,10 @@ def score_item(self, worda, wordb, components, scorer):

        """
        try:
-            return self.phrasegrams[tuple(components)][1]
+            if list(self.phrasegrams.values())[0].__class__ is tuple:


this check is very tricky to just check that value is instance of tuple. Python has isinstance builtin function for that.

creation of list of values is not easy operation. It would be better to avoid this.

Suggested change

if list(self.phrasegrams.values())[0].__class__ is tuple:

score = self.phrasegrams[tuple(components)]

return score[-1] if isinstance(score, tuple) else score

But isinstance has to be applied to dictionary value, which requires using list

@jenishah yes, and as you can see in suggested change, we are getting it and saving in score variable. We don't need a concrete value of dict, so we can just get that value which will be returned later.

piskvorky · 2018-11-22T06:22:49Z

gensim/models/phrases.py

@@ -848,7 +848,11 @@ def score_item(self, worda, wordb, components, scorer):

        """
        try:
-            return self.phrasegrams[tuple(components)][1]
+            score = self.phrasegrams[tuple(components)]
+            if isinstance(score, tuple):


We don't want to be checking the object version at run-time.

We want to check during load and convert to new object format if necessary right during loading.

Please have a look at LdaModel.load for an example of "conditional" loading for backward compatibility.

Is there any better approach other than to look at the type of values() in phrasegrams dict of the object?

No, I think that's fine: look at the type when loading the object, and if it's the "old" type (tuple), convert to the new type (floats) right away. Then at runtime, we continue working with the new type only.

piskvorky · 2018-11-26T10:57:29Z

gensim/models/phrases.py

@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs):
        """
        model = super(PhrasesTransformation, cls).load(*args, **kwargs)
        # update older models
+        # if value in phrasegrams dict is a tuple, load only the scores.
+        if len(model.__dict__['phrasegrams']):


if model.phrasegrams: more Pythonic.

piskvorky · 2018-11-26T10:58:41Z

gensim/models/phrases.py

@@ -208,6 +208,10 @@ def load(cls, *args, **kwargs):
        """
        model = super(PhrasesTransformation, cls).load(*args, **kwargs)
        # update older models
+        # if value in phrasegrams dict is a tuple, load only the scores.
+        if len(model.__dict__['phrasegrams']):
+            if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):


More readable:

if model.phrasegrams: first_value = list(model.phrasegrams.values())[0] if isinstance(first_value, tuple): …

piskvorky · 2018-11-26T11:00:30Z

gensim/models/phrases.py

+        # if value in phrasegrams dict is a tuple, load only the scores.
+        if len(model.__dict__['phrasegrams']):
+            if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
+                model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())


This is hard to read. Can you rephrase this with normal dict access syntax, and using better variable names? (not k and v and v[1]).

Also, changing the dict at the same time you're iterating over it sounds tricky. It may be safer to iterate over each value, and if it's a tuple, change it right away (in-place assignment, without a full dict copy).

Would there be any case where some values are tuple and some are not?
If not, we can check only one instance and based on that modify the entire dictionary.

I don't think so -- all values should always be in the same format (all old, all new).

My suggestion was more around being careful with modify-while-iterating and creating-a-copy-of-large-dict. So whether you check each value or check just once makes little difference.

In fact, per-value checks may be a bit cleaner, because there's no need for the special case of "dictionary is empty", or materializing all the values into a list only to pick the first one.

piskvorky · 2018-12-03T06:54:07Z

gensim/models/phrases.py

@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs):
        # update older models
        # if value in phrasegrams dict is a tuple, load only the scores.
        try:
-            if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
-                model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
+            for components, scores in model.__dict__['phrasegrams'].items():


Why this strange __dict__ access? Why not use the pattern I showed in my last review?

And what is the try for, what KeyError are we guarding against? Please add code comments.

piskvorky · 2018-12-03T06:55:31Z

gensim/models/phrases.py

@@ -210,10 +210,12 @@ def load(cls, *args, **kwargs):
        # update older models
        # if value in phrasegrams dict is a tuple, load only the scores.
        try:
-            if isinstance(list(model.__dict__['phrasegrams'].values())[0], tuple):
-                model.__dict__['phrasegrams'].update((k, v[1]) for k, v in model.__dict__['phrasegrams'].items())
+            for components, scores in model.__dict__['phrasegrams'].items():


Does this really work? It mutates the collection it's iterating over, which is usually a bad idea.

As mentioned previously, I'd make a copy of the original keys (not items) and iterate over that, while mutating the original (large) dict.

piskvorky

This is looking up, good progress!

piskvorky · 2018-12-04T08:17:04Z

gensim/models/phrases.py

-            pass
+        if model.phrasegrams:
+            components = model.phrasegrams.keys()
+            for component in components:


These two lines are better merged into one (so the temporary variable is released as soon as it's not needed).

piskvorky · 2018-12-04T08:18:02Z

gensim/models/phrases.py

+            for component in components:
+                score = model.phrasegrams[component]
+                if isinstance(score, tuple):
+                    model.phrasegrams[component] = score[1]


Deserves a comment: what is score[1]?

Or even better, unroll the tuple into properly named variables (x, y, z = score) and then assign that.

menshikh-iv · 2019-01-11T02:53:58Z

thanks for work @jenishah 💪

menshikh-iv changed the title ~~drop frequency from phrasegrams~~ Reduce phrases memory usage (drop frequency from phrasegrams) Oct 4, 2018

menshikh-iv mentioned this pull request Oct 4, 2018

Rename vocab attribute to bigram_counts #2195

Closed

menshikh-iv suggested changes Oct 15, 2018

View reviewed changes

fix phraser memory

242c80e

jenishah force-pushed the jshah_ph_mem branch from 2f0fcc8 to 4a96eab Compare October 20, 2018 13:26

piskvorky requested changes Oct 21, 2018

View reviewed changes

reduce phraser memory

bba2e46

jenishah force-pushed the jshah_ph_mem branch from 2f0fcc8 to bba2e46 Compare October 26, 2018 04:17

horpto reviewed Nov 8, 2018

View reviewed changes

using isinstance

9f9b05f

piskvorky requested changes Nov 22, 2018

View reviewed changes

jenishah added 3 commits November 26, 2018 10:55

update model when loaded

c391fe5

update model when loaded

d154e3a

update model when loaded

40b6672

piskvorky requested changes Nov 26, 2018

View reviewed changes

updated changes

40dcbde

piskvorky requested changes Dec 3, 2018

View reviewed changes

updated changes

21c3911

piskvorky requested changes Dec 4, 2018

View reviewed changes

jenishah and others added 2 commits December 4, 2018 15:18

update changes

80e9222

Merge remote-tracking branch 'upstream/develop' into jshah_ph_mem

9943909

menshikh-iv changed the title ~~Reduce phrases memory usage (drop frequency from phrasegrams)~~ Reduce Phraser memory usage (drop frequency from phrasegrams) Jan 10, 2019

menshikh-iv changed the title ~~Reduce Phraser memory usage (drop frequency from phrasegrams)~~ Reduce Phraser memory usage (drop frequencies) Jan 10, 2019

menshikh-iv added 2 commits January 10, 2019 17:18

fix loading

021226a

make test better

9de2495

menshikh-iv merged commit c5a8f73 into piskvorky:develop Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce `Phraser` memory usage (drop frequencies) #2208

Reduce `Phraser` memory usage (drop frequencies) #2208

jenishah commented Oct 3, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Oct 4, 2018

jenishah commented Oct 6, 2018

menshikh-iv commented Oct 6, 2018

menshikh-iv commented Oct 12, 2018

jenishah commented Oct 12, 2018

menshikh-iv Oct 15, 2018

menshikh-iv Oct 15, 2018

piskvorky Oct 21, 2018 •

edited

Loading

piskvorky Oct 21, 2018

horpto Nov 8, 2018

jenishah Nov 16, 2018 •

edited

Loading

horpto Nov 16, 2018

piskvorky Nov 22, 2018 •

edited

Loading

jenishah Nov 22, 2018

piskvorky Nov 22, 2018 •

edited

Loading

piskvorky Nov 26, 2018

piskvorky Nov 26, 2018

piskvorky Nov 26, 2018 •

edited

Loading

jenishah Nov 27, 2018

piskvorky Nov 27, 2018 •

edited

Loading

piskvorky Dec 3, 2018 •

edited

Loading

piskvorky Dec 3, 2018

piskvorky left a comment

piskvorky Dec 4, 2018

piskvorky Dec 4, 2018

menshikh-iv commented Jan 11, 2019

	if list(self.phrasegrams.values())[0].__class__ is tuple:
	score = self.phrasegrams[tuple(components)]
	return score[-1] if isinstance(score, tuple) else score

Reduce Phraser memory usage (drop frequencies) #2208

Reduce Phraser memory usage (drop frequencies) #2208

Conversation

jenishah commented Oct 3, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Oct 4, 2018

jenishah commented Oct 6, 2018

menshikh-iv commented Oct 6, 2018

menshikh-iv commented Oct 12, 2018

jenishah commented Oct 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Oct 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jenishah Nov 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Nov 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Nov 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Nov 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Nov 27, 2018 • edited Loading

Choose a reason for hiding this comment

piskvorky Dec 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Jan 11, 2019

Reduce `Phraser` memory usage (drop frequencies) #2208

Reduce `Phraser` memory usage (drop frequencies) #2208

jenishah commented Oct 3, 2018 •

edited by menshikh-iv

Loading

piskvorky Oct 21, 2018 •

edited

Loading

jenishah Nov 16, 2018 •

edited

Loading

piskvorky Nov 22, 2018 •

edited

Loading

piskvorky Nov 22, 2018 •

edited

Loading

piskvorky Nov 26, 2018 •

edited

Loading

piskvorky Nov 27, 2018 •

edited

Loading

piskvorky Dec 3, 2018 •

edited

Loading