models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

michaelwsherman · 2017-05-24T22:53:15Z

Based on this mailing list discussion, it seems desirable if models.Phrases supported alternative methods of scoring potential n-grams instead of forcing the use of the metric in the original word2vec paper (Section 4).

I have some changes I've made to models.Phrases that support an optional scoring function as an instance variable, which is then used in place of the default scoring. This is a relatively easy change to make, since it just replaces the score calculation in models.Phrases.export_phrases with a call to the scoring function.

I've also implemented normalized pointwise mutual information as an alternative scoring method in models.Phrases.export_phrases. It does, however, require an additional instance variable to track the corpus size, which is incremented as part of add_vocab.

If these changes are desirable, let me know and I'll submit a pull request. Otherwise, thanks for reading and close away :).

gojomo · 2017-05-25T00:16:24Z

Pluggable scoring sounds interesting and useful. The docs and refactoring of Phrases/etc would have to be handled carefully to avoid user confusion or a drag on performance for people using the classic scoring.

I believe there may be a Google Summer-of-Code project to make improvements to gensim's Phrases support – which may be focused on optimizations but might be able to consider other features/refactoring.

A well-designed, conscientious refactoring might also help resolve the open question on #1263/#1258 – whether that functionality should be parameterized in the main class, or in a clearly distinct class. (Something like: swappable creators of candidate-phrases, and swappable scorers.)

@tmylk should comment, but I'd think a PR could be useful even if it might take a while to integrate/reconcile into the main package. (It could be a useful model to others even before any full merge.)

piskvorky · 2017-05-25T04:27:19Z

Sounds good to me.

One thing to look out for: objects must be serializable, so the "injected function as a parameter" must be named (pickle cannot handle lambdas).

menshikh-iv · 2017-05-25T06:44:35Z

Maybe create PMI & NPMI as methods in Phrases and pass a string with a name to constructor (and match it to the method in __init__). This is less flexible, but avoid serialization problems

michaelwsherman · 2017-05-25T14:43:48Z

Right now the code is what @menshikh-iv to try--the scoring methods are specified with a text parameter that needs to be either 'mikolov' or 'npmi'. If this is a good starting point I'll make the pull request--give me a week or two to figure things out since this is the first time I'll be contributing to open source.

@piskvorky, you raise a good point. Is there any easy way to test if a function is serializable (other than to try to pickle it and load it?). If it's an easy change I can make my existing code into pluggable functions and add the check as part of the constructor.

piskvorky · 2017-05-29T04:05:33Z

-1 on passing strings instead of functions. Just pass the function as a parameter, directly -- more flexible.

@michaelwsherman no such check is needed. Python is duck-typed, so it's up to the user to pass parameters of the correct "type". And I can imagine use-cases where no serialization is needed (user doesn't need to save/load), so a forceful check would be harmful.

gojomo · 2017-05-29T05:29:23Z

@michaelwsherman I believe functions are serializable if-and-only-if they're defined globally (at the 'top level'). In that case pickle manages to write them as their global name; then if during unpickling the same function is (already imported) as the same global name, all works as expected.

menshikh-iv added the wishlist Feature request label May 27, 2017

michaelwsherman mentioned this issue Jul 5, 2017

models.Phrases multiple scoring methods (#1363) #1464

Merged

menshikh-iv closed this as completed in 5f54b60 Jul 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

michaelwsherman commented May 24, 2017

gojomo commented May 25, 2017

piskvorky commented May 25, 2017 •

edited

Loading

menshikh-iv commented May 25, 2017

michaelwsherman commented May 25, 2017

piskvorky commented May 29, 2017 •

edited

Loading

gojomo commented May 29, 2017

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

Comments

michaelwsherman commented May 24, 2017

gojomo commented May 25, 2017

piskvorky commented May 25, 2017 • edited Loading

menshikh-iv commented May 25, 2017

michaelwsherman commented May 25, 2017

piskvorky commented May 29, 2017 • edited Loading

gojomo commented May 29, 2017

piskvorky commented May 25, 2017 •

edited

Loading

piskvorky commented May 29, 2017 •

edited

Loading