-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1533 fix and 1464 1423 comments #1573
Merged
menshikh-iv
merged 19 commits into
piskvorky:develop
from
bloomberg:1533_fix_and_1464_1423_comments
Oct 24, 2017
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
21c4401
initial commit of fixes in comments of #1423
0590c2f
removed unnecessary space in logger
34dc58f
added support for custom Phrases scorers
32b66bd
fixed Phrases.__getitem__ to support pluggable scoring #1533
9b3f801
travisCI style fixes
2698aa7
fixed __next__() to next() for python 3 compatibilyt
accea8c
misc fixes
8854097
spacing fixes for style
bbaf3f7
custom scorer support in sklearn api
4e555c4
Phrases scikit interface tests for pluggable scoring
b16554f
missing line breaks
a94a3fd
style, clarity, and robustness fixes requested by @piskvorky
f9cc04f
check in Phrases init to make sure scorer is pickleable
5bbe144
backwards scoring compatibility when loading a Phrases class
1481342
removal of pickle testing objects in Phrases init
fb7fbb1
switched to six for python 2/3 compatibility
d7bdcc0
merged changes from upstream/develop
336f4f6
Merge branch 'develop' into 1533_fix_and_1464_1423_comments
menshikh-iv e866d3f
fix docstring
menshikh-iv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used in line 188 (in the commit your comments are on) to check for the proper parameters in the pluggable scoring function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see it now. What is that check for though? Python is duck-typed by convention, so "type checks" are best postponed until truly needed (something breaks).
What is the rationale for this pre-emptive type check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly to save the stress that would result from improperly specifying a scoring function when initializing the phrases object. I know Python will do the type checking when the scoring function is called, but that won't happen until export_phrases or getitem is called. The "normal" workflow for the Phrases object is to just specify sentences on load, or to use add_vocab. Only after that does the scoring function get called.
I could easily see a user specifying a bad scoring method and then making the vocab dictionary from their large corpus. Only after significant time extracting vocab from a corpus do they then discover that something is wrong with how they specified scoring. At this point you could manually specify a correct scoring function, but that requires you to set it directly. Users also wouldn't have an easy bailout in the form of use one of the scorer string settings, since those are only checked when the Phrases object is created--the user would have to figure out how to specify those built in scorers which would mean opening up the code. This seems a bit user unfriendly, I feel it is friendlier to just do the type checking on initialization even if it is less Pythonic.
This could be fixed with a set_scorer method that takes the string or function input, but that seems a bit more awkward than just doing this type check.
There's also an issue with wanting to raise an informative exception when the scoring function is called in getitem or export_phrases and the types don't match, but that means adding a try/except into the main scoring loop and that seems awkward as well. I think its better to just do that try/except once when the object is initialized.
But I defer to your judgement on this--what do you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I see your argument (that checking early a little more convenient).
I'm not sure if it's worth it, but don't care much either way. I'll defer to @menshikh-iv :)