1533 fix and 1464 1423 comments #1573

michaelwsherman · 2017-09-06T17:37:14Z

Fix for #1533 , Phrases.getitem now support custom scoring. Pluggable scoring via a function parameter is Phrases is now supported.

Made fixes in comments in PRs #1464 and #1423, except did not explicitly cast floats in the pre-defined scoring methods. Rather, float casting is now done before calling the scoring method.

michaelwsherman · 2017-09-06T18:08:40Z

One unfortunate omission here is a lack of a test of pickling a Phrases and a Phraser object, to make sure pluggable scoring works properly after unpickling. I don't know enough about pickling to know what the risks here are, or if it is even worth testing this.

I'm worried that if a Phraser (or Phrases) object from an older version of gensim is loaded, it won't work now since older Phraser objects won't have an assigned scoring function and utils.SaveLoad is just pickling. I think the backwards compatibility could easily be handled (just a matter of checking for the presence of the scoring function and loading the right one if it isn't present), but I'm not sure of the best place to handle it from an architecture standpoint.

piskvorky · 2017-09-06T18:41:39Z

@michaelwsherman thanks, much appreciated!

The best place to handle backward compatibility is inside load. That is, after loading an object, check whether the required attributes exist. If not, this implies an older version, so set the attributes to the defaults that correspond to the "old" behaviour (=backward compatibility).

You can see that approach in action in word2vec here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1408

michaelwsherman · 2017-09-06T20:41:18Z

I'm failing a test in test_sklearn_api.py related to pickling and unpickling (see the end of this comment). It's not surprising that pickling doesn't work, since scorer is now a function and functions don't pickle with default pickling.

The fix for this likely involves some custom pickling. My (brief) research has led me to the __getstate__() and __setstate__() methods (as described in the python docs). It looks like this needs to be done to support pluggable scoring through utils.SaveLoad as well. This work is in addition to the compatibility checks that need to be handled inside of load.

I can maybe embark on these fixes, but I'd like to know I'm on the right track before I go down this rabbit hole. It's also likely going to be a bit (a few weeks) for this fix, as I'll probably need a couple of days to figure out how this is done (mainly for the pickling part).

======================================================================
ERROR: testPersistence (main.TestPhrasesTransformer)

Traceback (most recent call last):
File "test_sklearn_api.py", line 946, in testPersistence
model_dump = pickle.dumps(self.model)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 1380, in dumps
Pickler(file, protocol).dump(obj)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 224, in dump
self.save(obj)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 425, in save_reduce
save(state)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 669, in _batch_setitems
save(v)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 425, in save_reduce
save(state)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 669, in _batch_setitems
save(v)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 754, in save_global
(obj, module, name))
PicklingError: Can't pickle <function original_scorer at 0x0000000006C71208>: it
's not found as gensim.models.phrases.original_scorer

piskvorky · 2017-09-07T08:36:37Z

@michaelwsherman that doesn't sound right. Functions are picklable no problem, they just need to be named functions (e.g. gensim.models.phrases.original_scorer is fine; lambda x: x + 2 is not fine), since it's the function name (identifier) that actually gets pickled.

The traceback suggests a named function, so I suspect the problem lies elsewhere.

michaelwsherman · 2017-09-07T13:15:14Z

@piskvorky Sorry, you're right, and thank you for your quick response. This is a good learning experience for me.

I think it is because the methods are static methods in the models.phrases.Phrases class. I don't think there is any reason they have to be static methods in the class, so I'm making them regular methods in the base of models.Phrases. This may be even better, since any passed scoring method would not be in the Phrases class. All the tests pass if I do this.

I'm going to make this change, add support for the custom scoring to the scikit interface, and add tests for using a custom scorer (including persistence) to the scikit interface, add the save/load support for backwards compatibility, and throw together some tests for that as well. These changes may take a little bit.

I'm not going to add support for backwards compatibility through the scikit interface, as I expect that persistence via pickle across versions is not supported. Tell me if this is incorrect, and I'll start looking into the possibility of it.

piskvorky

A few questions + code style comments.

piskvorky · 2017-09-07T17:39:09Z

gensim/models/phrases.py

@@ -177,9 +177,9 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
        # to still run the check of scoring function parameters in the next code block
        if type(scoring) is str:


if isinstance(scoring, basestring) more standard and future-proof.

piskvorky · 2017-09-07T17:40:05Z

gensim/models/phrases.py

 from math import log
+from inspect import getargspec


Import not used?

Used in line 188 (in the commit your comments are on) to check for the proper parameters in the pluggable scoring function.

Thanks, I see it now. What is that check for though? Python is duck-typed by convention, so "type checks" are best postponed until truly needed (something breaks).

What is the rationale for this pre-emptive type check?

Mostly to save the stress that would result from improperly specifying a scoring function when initializing the phrases object. I know Python will do the type checking when the scoring function is called, but that won't happen until export_phrases or getitem is called. The "normal" workflow for the Phrases object is to just specify sentences on load, or to use add_vocab. Only after that does the scoring function get called.

I could easily see a user specifying a bad scoring method and then making the vocab dictionary from their large corpus. Only after significant time extracting vocab from a corpus do they then discover that something is wrong with how they specified scoring. At this point you could manually specify a correct scoring function, but that requires you to set it directly. Users also wouldn't have an easy bailout in the form of use one of the scorer string settings, since those are only checked when the Phrases object is created--the user would have to figure out how to specify those built in scorers which would mean opening up the code. This seems a bit user unfriendly, I feel it is friendlier to just do the type checking on initialization even if it is less Pythonic.

This could be fixed with a set_scorer method that takes the string or function input, but that seems a bit more awkward than just doing this type check.

There's also an issue with wanting to raise an informative exception when the scoring function is called in getitem or export_phrases and the types don't match, but that means adding a try/except into the main scoring loop and that seems awkward as well. I think its better to just do that try/except once when the object is initialized.

But I defer to your judgement on this--what do you think is best?

Thanks, I see your argument (that checking early a little more convenient).

I'm not sure if it's worth it, but don't care much either way. I'll defer to @menshikh-iv :)

piskvorky · 2017-09-07T17:40:28Z

gensim/models/phrases.py

-    # len_vocab and min_count set so functools.partial works
-    @staticmethod
-    def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0):
+def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
        return (bigram_count - min_count) / worda_count / wordb_count * len_vocab


Bad indent.

piskvorky · 2017-09-07T17:40:34Z

gensim/models/phrases.py

-    def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0):
+
+# normalized PMI, requires corpus size
+def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
        pa = worda_count / corpus_word_count


Bad indent.

Sorry about these, very sloppy on my part.

piskvorky · 2017-09-07T17:41:03Z

gensim/models/word2vec.py

            self.input_files = [self.source]  # force code compatibility with list of files
        elif os.path.isdir(self.source):
            self.source = os.path.join(self.source, '')  # ensures os-specific slash at end of path
-            logging.debug('reading directory ' + self.source)
+            logger.warning('reading directory %s', self.source)


Why is this a warning?

I made a mistake, changing to info.

piskvorky · 2017-09-07T17:42:08Z

gensim/models/word2vec.py

@@ -1563,7 +1563,7 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
        """
        `source` should be a path to a directory (as a string) where all files can be opened by the
        LineSentence class. Each file will be read up to
-        `limit` lines (or no clipped if limit is None, the default).
+        `limit` lines (or not clipped if limit is None, the default).


The docs are not clear -- does the "will process all files in a directory" work recursively?

It does not. Maybe wishlist? I've clarified the docs.

piskvorky · 2017-09-07T17:43:26Z

gensim/models/word2vec.py

@@ -1577,23 +1577,23 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
        self.limit = limit

        if os.path.isfile(self.source):
-            logging.warning('single file read, better to use models.word2vec.LineSentence')
+            logger.warning('single file read, better to use models.word2vec.LineSentence')


If the class API contract supports it, this is no warning (maybe debug, at most).

If it's outside the API contract, this is an error and we should raise an exception, not log a warning.

Clarified this message a bit, made it debug.

piskvorky · 2017-09-08T07:47:40Z

gensim/models/phrases.py

@@ -177,6 +176,13 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
        # set scoring based on string
        # intentially override the value of the scoring parameter rather than set self.scoring here,
        # to still run the check of scoring function parameters in the next code block
+
+        # for python 2 and 3 compatibility. basestring is used to check if scoring is a string


Almost there :) We use six in gensim for py2/py3 compatibility, so isinstance on six.string_types is probably what we want.

michaelwsherman · 2017-10-03T22:50:20Z

This should be ready to go now, the requested changes have been made (or discussed) and everything is up to date with develop.

menshikh-iv

@michaelwsherman sorry for waiting, looks good for me:+1:
Let's wait for merge #1568, after - need to resolve conflicts here and merge.

menshikh-iv · 2017-10-24T12:21:28Z

Nice work @michaelwsherman 🔥

@piskvorky

…iskvorky#1573) * initial commit of fixes in comments of piskvorky#1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring

Michael Sherman added 4 commits September 5, 2017 13:27

initial commit of fixes in comments of piskvorky#1423

21c4401

removed unnecessary space in logger

0590c2f

added support for custom Phrases scorers

34dc58f

fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

32b66bd

travisCI style fixes

9b3f801

This was referenced Sep 6, 2017

models.Phrases multiple scoring methods (#1363) #1464

Merged

Add word2vec.PathLineSentences for reading a directory as a corpus (#1364) #1423

Merged

Phrases __getitem__() method does not respect chosen scoring function #1533

Closed

fixed __next__() to next() for python 3 compatibilyt

2698aa7

Michael Sherman added 4 commits September 7, 2017 10:53

misc fixes

accea8c

spacing fixes for style

8854097

custom scorer support in sklearn api

bbaf3f7

Phrases scikit interface tests for pluggable scoring

4e555c4

piskvorky requested changes Sep 7, 2017

View reviewed changes

Michael Sherman added 5 commits September 7, 2017 14:23

missing line breaks

b16554f

style, clarity, and robustness fixes requested by @piskvorky

a94a3fd

check in Phrases init to make sure scorer is pickleable

f9cc04f

backwards scoring compatibility when loading a Phrases class

5bbe144

removal of pickle testing objects in Phrases init

1481342

piskvorky requested changes Sep 8, 2017

View reviewed changes

Michael Sherman added 2 commits September 11, 2017 11:32

switched to six for python 2/3 compatibility

fb7fbb1

merged changes from upstream/develop

d7bdcc0

menshikh-iv reviewed Oct 18, 2017

View reviewed changes

menshikh-iv mentioned this pull request Oct 19, 2017

Scoring function in Phrases model is hardcoded #1635

Closed

menshikh-iv added 2 commits October 24, 2017 14:57

Merge branch 'develop' into 1533_fix_and_1464_1423_comments

336f4f6

fix docstring

e866d3f

menshikh-iv merged commit a5872fa into piskvorky:develop Oct 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1533 fix and 1464 1423 comments #1573

1533 fix and 1464 1423 comments #1573

michaelwsherman commented Sep 6, 2017

michaelwsherman commented Sep 6, 2017

piskvorky commented Sep 6, 2017 •

edited

Loading

michaelwsherman commented Sep 6, 2017

======================================================================
ERROR: testPersistence (main.TestPhrasesTransformer)

piskvorky commented Sep 7, 2017 •

edited

Loading

michaelwsherman commented Sep 7, 2017

piskvorky left a comment

piskvorky Sep 7, 2017

piskvorky Sep 7, 2017

michaelwsherman Sep 7, 2017

piskvorky Sep 8, 2017 •

edited

Loading

michaelwsherman Sep 11, 2017

piskvorky Sep 11, 2017

piskvorky Sep 7, 2017

piskvorky Sep 7, 2017

michaelwsherman Sep 7, 2017

piskvorky Sep 7, 2017

michaelwsherman Sep 7, 2017

piskvorky Sep 7, 2017

michaelwsherman Sep 7, 2017

piskvorky Sep 7, 2017

michaelwsherman Sep 7, 2017

piskvorky Sep 8, 2017

michaelwsherman commented Oct 3, 2017

menshikh-iv left a comment

menshikh-iv commented Oct 24, 2017

		@@ -177,9 +177,9 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
		# to still run the check of scoring function parameters in the next code block
		if type(scoring) is str:

1533 fix and 1464 1423 comments #1573

1533 fix and 1464 1423 comments #1573

Conversation

michaelwsherman commented Sep 6, 2017

michaelwsherman commented Sep 6, 2017

piskvorky commented Sep 6, 2017 • edited Loading

michaelwsherman commented Sep 6, 2017

====================================================================== ERROR: testPersistence (main.TestPhrasesTransformer)

piskvorky commented Sep 7, 2017 • edited Loading

michaelwsherman commented Sep 7, 2017

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwsherman commented Oct 3, 2017

menshikh-iv left a comment

Choose a reason for hiding this comment

menshikh-iv commented Oct 24, 2017

piskvorky commented Sep 6, 2017 •

edited

Loading

======================================================================
ERROR: testPersistence (main.TestPhrasesTransformer)

piskvorky commented Sep 7, 2017 •

edited

Loading

piskvorky Sep 8, 2017 •

edited

Loading