-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for evaluating threshold settings in models.phrases.Phrases #1465
Comments
Sounds useful! |
@michaelwsherman implemented in #2979: The "old" functionality of finding phrases in a corpus was renamed to |
Another request that comes up from time to time: add some hand-selected bigrams (or longer) that a user has independently determined they want as phrases. It might be interesting to offer some tuning tools/methods that report: "if you want X, Y, Z to be phrasified, you'd have to set the parameters to N, M, etc, but then the top-20 most-marginal phrases would be {P1, P2, ...}" (so they'd see the side effects of those settings). Alternatively, it might be possible (and now easier/cleaner with the #2976/etc refactorings) to add some exception set of 'forced' user choices that always combine regardless of their score, or conversely some exception set of 'suppressed' that never combine after the user notices they're unwanted, that meet some user needs. (Though perhaps, such exception-lists are a fool's errand given the inherent roughness of this technique, which in my experience often improves the raw texts passed into IR/classification steps but is rarely conformant enough to human-level perceived phrases that you'd want to show the combinations to average users.) (For people who don't need any bulk statistical phrase discovery, but just a preprocessing step that applies their hand-chosen phrases, some users don't realize that's pretty easy in Python; adding some code that only does that might be a nice preprocessing utility as well -- see for example my demo code in an SO answer: https://stackoverflow.com/questions/58839049/python-connect-composed-keywords-in-texts/58864397#58864397) |
Wouldn't this work? phrases = Phrases(…)
frozen_phrases = phrases.freeze()
frozen_phrases.phrasegrams['my_phrase'] = float('inf') Likewise for a blacklist – remove the offending key If that's all that is needed then I'd leave it in userland / FAQ, no need for any special API. |
Sure! That'd be great in a help page/recipe, or as convenience methods to 'force' or 'delete' specific phrases in a frozen model. (Though, it strikes me that people may want to make the decision once that phrase X should either 'always' or 'never' be created - without having to potentially re-patch the model after any incremental training that might have undone a previous forced-score or manual deletion.) |
One of the challenges when learning bigrams from a new corpus is determining the right threshold for your scoring to accept or reject a potential bigram. Standard approaches involve taking a list of gold-labeled bigrams created by humans, ranking all bigrams in your corpus by their score, and determining a score threshold based on a comparison to the gold labels. (for one example, see this paper
Right now, doing this with built in functionality in models.Phrases requires running export_phrases on your whole corpus. For an especially large corpus, this could mean a lot of wasted time waiting for export_phrases to run. It also means getting some strange results--if you set your threshold very low, a strong bigram in your corpus may not be output by export_phrases if the first word in the strong bigram is especially rare and the strong bigram is proceeded by a limited set of words (you'd get a bunch of lower scoring bigrams with the first word of the strong bigram as the second word of some weaker bigrams).
There should be a method that only traverses the vocab dictionary and returns something that shows the scores for the bigrams in the corpus. This would be both faster than export_phrases and would ensure that all bigrams (that exceed some threshold) have their score output. I have code to do something like this, and I'm happy to contribute it. (Although it might make sense to wait until after #1464 and maybe #1446 are finalized.)
The text was updated successfully, but these errors were encountered: