Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1533 fix and 1464 1423 comments #1573

Merged

Conversation

michaelwsherman
Copy link
Contributor

Fix for #1533 , Phrases.getitem now support custom scoring. Pluggable scoring via a function parameter is Phrases is now supported.

Made fixes in comments in PRs #1464 and #1423, except did not explicitly cast floats in the pre-defined scoring methods. Rather, float casting is now done before calling the scoring method.

@michaelwsherman
Copy link
Contributor Author

One unfortunate omission here is a lack of a test of pickling a Phrases and a Phraser object, to make sure pluggable scoring works properly after unpickling. I don't know enough about pickling to know what the risks here are, or if it is even worth testing this.

I'm worried that if a Phraser (or Phrases) object from an older version of gensim is loaded, it won't work now since older Phraser objects won't have an assigned scoring function and utils.SaveLoad is just pickling. I think the backwards compatibility could easily be handled (just a matter of checking for the presence of the scoring function and loading the right one if it isn't present), but I'm not sure of the best place to handle it from an architecture standpoint.

@piskvorky
Copy link
Owner

piskvorky commented Sep 6, 2017

@michaelwsherman thanks, much appreciated!

The best place to handle backward compatibility is inside load. That is, after loading an object, check whether the required attributes exist. If not, this implies an older version, so set the attributes to the defaults that correspond to the "old" behaviour (=backward compatibility).

You can see that approach in action in word2vec here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/word2vec.py#L1408

@michaelwsherman
Copy link
Contributor Author

I'm failing a test in test_sklearn_api.py related to pickling and unpickling (see the end of this comment). It's not surprising that pickling doesn't work, since scorer is now a function and functions don't pickle with default pickling.

The fix for this likely involves some custom pickling. My (brief) research has led me to the __getstate__() and __setstate__() methods (as described in the python docs). It looks like this needs to be done to support pluggable scoring through utils.SaveLoad as well. This work is in addition to the compatibility checks that need to be handled inside of load.

I can maybe embark on these fixes, but I'd like to know I'm on the right track before I go down this rabbit hole. It's also likely going to be a bit (a few weeks) for this fix, as I'll probably need a couple of days to figure out how this is done (mainly for the pickling part).

======================================================================
ERROR: testPersistence (main.TestPhrasesTransformer)

Traceback (most recent call last):
File "test_sklearn_api.py", line 946, in testPersistence
model_dump = pickle.dumps(self.model)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 1380, in dumps
Pickler(file, protocol).dump(obj)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 224, in dump
self.save(obj)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 425, in save_reduce
save(state)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 669, in _batch_setitems
save(v)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 425, in save_reduce
save(state)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 669, in _batch_setitems
save(v)
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Users\msherman49\AppData\Local\Continuum\Anaconda3\envs\ml\lib\pickle
.py", line 754, in save_global
(obj, module, name))
PicklingError: Can't pickle <function original_scorer at 0x0000000006C71208>: it
's not found as gensim.models.phrases.original_scorer


@piskvorky
Copy link
Owner

piskvorky commented Sep 7, 2017

@michaelwsherman that doesn't sound right. Functions are picklable no problem, they just need to be named functions (e.g. gensim.models.phrases.original_scorer is fine; lambda x: x + 2 is not fine), since it's the function name (identifier) that actually gets pickled.

The traceback suggests a named function, so I suspect the problem lies elsewhere.

@michaelwsherman
Copy link
Contributor Author

@piskvorky Sorry, you're right, and thank you for your quick response. This is a good learning experience for me.

I think it is because the methods are static methods in the models.phrases.Phrases class. I don't think there is any reason they have to be static methods in the class, so I'm making them regular methods in the base of models.Phrases. This may be even better, since any passed scoring method would not be in the Phrases class. All the tests pass if I do this.

I'm going to make this change, add support for the custom scoring to the scikit interface, and add tests for using a custom scorer (including persistence) to the scikit interface, add the save/load support for backwards compatibility, and throw together some tests for that as well. These changes may take a little bit.

I'm not going to add support for backwards compatibility through the scikit interface, as I expect that persistence via pickle across versions is not supported. Tell me if this is incorrect, and I'll start looking into the possibility of it.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions + code style comments.

@@ -177,9 +177,9 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
# to still run the check of scoring function parameters in the next code block
if type(scoring) is str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(scoring, basestring) more standard and future-proof.

from math import log
from inspect import getargspec
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in line 188 (in the commit your comments are on) to check for the proper parameters in the pluggable scoring function.

Copy link
Owner

@piskvorky piskvorky Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see it now. What is that check for though? Python is duck-typed by convention, so "type checks" are best postponed until truly needed (something breaks).

What is the rationale for this pre-emptive type check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly to save the stress that would result from improperly specifying a scoring function when initializing the phrases object. I know Python will do the type checking when the scoring function is called, but that won't happen until export_phrases or getitem is called. The "normal" workflow for the Phrases object is to just specify sentences on load, or to use add_vocab. Only after that does the scoring function get called.

I could easily see a user specifying a bad scoring method and then making the vocab dictionary from their large corpus. Only after significant time extracting vocab from a corpus do they then discover that something is wrong with how they specified scoring. At this point you could manually specify a correct scoring function, but that requires you to set it directly. Users also wouldn't have an easy bailout in the form of use one of the scorer string settings, since those are only checked when the Phrases object is created--the user would have to figure out how to specify those built in scorers which would mean opening up the code. This seems a bit user unfriendly, I feel it is friendlier to just do the type checking on initialization even if it is less Pythonic.

This could be fixed with a set_scorer method that takes the string or function input, but that seems a bit more awkward than just doing this type check.

There's also an issue with wanting to raise an informative exception when the scoring function is called in getitem or export_phrases and the types don't match, but that means adding a try/except into the main scoring loop and that seems awkward as well. I think its better to just do that try/except once when the object is initialized.

But I defer to your judgement on this--what do you think is best?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see your argument (that checking early a little more convenient).

I'm not sure if it's worth it, but don't care much either way. I'll defer to @menshikh-iv :)

# len_vocab and min_count set so functools.partial works
@staticmethod
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0):
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
return (bigram_count - min_count) / worda_count / wordb_count * len_vocab
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indent.

def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0):

# normalized PMI, requires corpus size
def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
pa = worda_count / corpus_word_count
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about these, very sloppy on my part.

self.input_files = [self.source] # force code compatibility with list of files
elif os.path.isdir(self.source):
self.source = os.path.join(self.source, '') # ensures os-specific slash at end of path
logging.debug('reading directory ' + self.source)
logger.warning('reading directory %s', self.source)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake, changing to info.

@@ -1563,7 +1563,7 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
"""
`source` should be a path to a directory (as a string) where all files can be opened by the
LineSentence class. Each file will be read up to
`limit` lines (or no clipped if limit is None, the default).
`limit` lines (or not clipped if limit is None, the default).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are not clear -- does the "will process all files in a directory" work recursively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. Maybe wishlist? I've clarified the docs.

@@ -1577,23 +1577,23 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
self.limit = limit

if os.path.isfile(self.source):
logging.warning('single file read, better to use models.word2vec.LineSentence')
logger.warning('single file read, better to use models.word2vec.LineSentence')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the class API contract supports it, this is no warning (maybe debug, at most).

If it's outside the API contract, this is an error and we should raise an exception, not log a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified this message a bit, made it debug.

@@ -177,6 +176,13 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
# set scoring based on string
# intentially override the value of the scoring parameter rather than set self.scoring here,
# to still run the check of scoring function parameters in the next code block

# for python 2 and 3 compatibility. basestring is used to check if scoring is a string
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there :) We use six in gensim for py2/py3 compatibility, so isinstance on six.string_types is probably what we want.

@michaelwsherman
Copy link
Contributor Author

This should be ready to go now, the requested changes have been made (or discussed) and everything is up to date with develop.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelwsherman sorry for waiting, looks good for me:+1:
Let's wait for merge #1568, after - need to resolve conflicts here and merge.

@menshikh-iv
Copy link
Contributor

Nice work @michaelwsherman 🔥

@menshikh-iv menshikh-iv merged commit a5872fa into piskvorky:develop Oct 24, 2017
horpto pushed a commit to horpto/gensim that referenced this pull request Oct 28, 2017
…iskvorky#1573)

* initial commit of fixes in comments of piskvorky#1423

* removed unnecessary space in logger

* added support for custom Phrases scorers

* fixed Phrases.__getitem__ to support pluggable scoring piskvorky#1533

* travisCI style fixes

* fixed __next__() to next() for python 3 compatibilyt

* misc fixes

* spacing fixes for style

* custom scorer support in sklearn api

* Phrases scikit interface tests for pluggable scoring

* missing line breaks

* style, clarity, and robustness fixes requested by @piskvorky

* check in Phrases init to make sure scorer is pickleable

* backwards scoring compatibility when loading a Phrases class

* removal of pickle testing objects in Phrases init

* switched to six for python 2/3 compatibility

* fix docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants