-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363
Comments
Pluggable scoring sounds interesting and useful. The docs and refactoring of Phrases/etc would have to be handled carefully to avoid user confusion or a drag on performance for people using the classic scoring. I believe there may be a Google Summer-of-Code project to make improvements to gensim's Phrases support – which may be focused on optimizations but might be able to consider other features/refactoring. A well-designed, conscientious refactoring might also help resolve the open question on #1263/#1258 – whether that functionality should be parameterized in the main class, or in a clearly distinct class. (Something like: swappable creators of candidate-phrases, and swappable scorers.) @tmylk should comment, but I'd think a PR could be useful even if it might take a while to integrate/reconcile into the main package. (It could be a useful model to others even before any full merge.) |
Sounds good to me. One thing to look out for: objects must be serializable, so the "injected function as a parameter" must be named (pickle cannot handle lambdas). |
Maybe create PMI & NPMI as methods in Phrases and pass a string with a name to constructor (and match it to the method in |
Right now the code is what @menshikh-iv to try--the scoring methods are specified with a text parameter that needs to be either 'mikolov' or 'npmi'. If this is a good starting point I'll make the pull request--give me a week or two to figure things out since this is the first time I'll be contributing to open source. @piskvorky, you raise a good point. Is there any easy way to test if a function is serializable (other than to try to pickle it and load it?). If it's an easy change I can make my existing code into pluggable functions and add the check as part of the constructor. |
-1 on passing strings instead of functions. Just pass the function as a parameter, directly -- more flexible. @michaelwsherman no such check is needed. Python is duck-typed, so it's up to the user to pass parameters of the correct "type". And I can imagine use-cases where no serialization is needed (user doesn't need to save/load), so a forceful check would be harmful. |
@michaelwsherman I believe functions are serializable if-and-only-if they're defined globally (at the 'top level'). In that case pickle manages to write them as their global name; then if during unpickling the same function is (already imported) as the same global name, all works as expected. |
Based on this mailing list discussion, it seems desirable if models.Phrases supported alternative methods of scoring potential n-grams instead of forcing the use of the metric in the original word2vec paper (Section 4).
I have some changes I've made to models.Phrases that support an optional scoring function as an instance variable, which is then used in place of the default scoring. This is a relatively easy change to make, since it just replaces the score calculation in models.Phrases.export_phrases with a call to the scoring function.
I've also implemented normalized pointwise mutual information as an alternative scoring method in models.Phrases.export_phrases. It does, however, require an additional instance variable to track the corpus size, which is incremented as part of add_vocab.
If these changes are desirable, let me know and I'll submit a pull request. Otherwise, thanks for reading and close away :).
The text was updated successfully, but these errors were encountered: