-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10299][ML] word2vec should allow users to specify the window size #8513
[SPARK-10299][ML] word2vec should allow users to specify the window size #8513
Conversation
Test build #41762 has finished for PR 8513 at commit
|
@@ -49,6 +49,17 @@ private[feature] trait Word2VecBase extends Params | |||
def getVectorSize: Int = $(vectorSize) | |||
|
|||
/** | |||
* The window size (context words from [-window, window]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: end line with "."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but we should also have the defaults documented in the scaladocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe make a cleanup JIRA to do this for all the params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor doc comments which could be addressed in separate PR |
Test build #41847 has finished for PR 8513 at commit
|
cc @Ishiihara who I think was maybe the original author of the fixed window size. |
ping @mengxr who has some recent commits in this file. |
@holdenk LGTM. The reason to make the window size constant is that the window size does not affect the result too much given a large corpus. |
@Ishiihara do you think it is worth merging in then or not so much? The documentation I've seen for different word2vec implementations seem to indicate that changing the window size can make a difference (and there is the user request to make it configurable). |
ping @mengxr or @jkbradley if this looks ok to you it would be nice to get merged in |
Although the window size doesn't matter a lot, yeah, it seems desirable to make it configurable. |
@srowen Would you be comfortable merging given the existing review by the original author? Or should I get another set of eyes to take a look? |
I'm OK with this but I'm only uncertain about merging for 1.6.0. Eh, 1.7.0? 2.0.0? it just matters because of the version label in |
Feel free to do whatever. |
@srowen @marmbrus @rxin since 1.6.0-RC2 will still be cut as there seem to be a few critical bugs, e.g. https://issues.apache.org/jira/browse/SPARK-12155 and https://issues.apache.org/jira/browse/SPARK-12165, and this is a very minor change, can we put in in 1.6.0? If not, I think just target 1.7.0 for now . LGTM too. |
I'm OK with that; it's quite safe and minor. I'd understand if someone objected since it's not a fix. Let me pause for that. |
@@ -49,6 +49,17 @@ private[feature] trait Word2VecBase extends Params | |||
def getVectorSize: Int = $(vectorSize) | |||
|
|||
/** | |||
* The window size (context words from [-window, window]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State default
Just had minor comments, but I feel like the SQLContext issue should probably be fixed before merging. I'm OK with putting it in 1.6 |
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removing this final line? i think this would fail style checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this show up this way in the github diff viewer, there is a newline after the windowsize test (I'll remerge in master and see if fixes the diff view)
Test build #2182 has finished for PR 8513 at commit
|
Test build #47372 has finished for PR 8513 at commit
|
@jkbradley addressed the issues (also cleaned up the rest of the tests in the same file) |
Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size. (cherry picked from commit 22b9a87) Signed-off-by: Sean Owen <sowen@cloudera.com>
Merged to master/1.6 |
Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size .