-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding common-words to Phrases #1258
Comments
Sounds useful, so I would welcome it as an option. If adding to the existing class could be a small change, and it leaves behavior of 'classic' mode unchanged when not activated, that seems an OK way to add it. But I know there may be some other efforts in progress to optimize (or Cythonize) Looking at the description at the ElasticSearch link, I wonder:
|
It is a needed functionality and a Pure Python implementation is a good place to start. Please make it optional though. |
Hello, thanks for the comments, I will begin by providing a pure python stand-alone implementation. @gojomo, +1 to use common_words as a name. |
Per the example on the elasticsearch page (about "the quick brown is a fox"), I would expect: input: ["we", "provide", "car", "with", "driver"] …to yield… output: ["we", "provide", "car", "car_with", "with", "with_driver", "driver"] That might be ideal for search-indexing, and some gensim users, but would be unexpected (with unclear implications) for something like Word2Vec neighboring-word context-windows. |
@gojomo - clearly I just draw inspiration from the common word filter, but it is an adaptation (yielding only car_with_driver) |
Despite the difficulties for later order-sensitive windows, the elasticsearch approach seems potentially more valuable to me, in that all possibilities are generated, then only some might survive some later frequency- or salience-check. Combining the common-word on both-sides, always, seems likely to create overlong phrases. For example, what would (or should) happen in longer common-uncommon-common-uncommon-etc patterns? For example, assuming each of the
...become just the 3 tokens...
? |
What about bigrams where one of the words is a stopword and it is actually a phrase? This happens in legal documents such as "Side A" or "Exhibit A". Would the CommonGrams Phraser help with that case? |
I have a proof of concept of Phrases class managing stop words, but before doing a pull request, I would be glad to know if there is interest and how to integrate it.
That is currently if you are searching to reveal ngrams like "car with driver" and "car without driver", you can either remove stop words before processing, but you will only find "car driver", or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly).
Taking inspiration from elasticsearch and its common gram filter I have an implementation which can handle stop words to find those expressions. It does it by registering "car_with_driver" in the vocab instead of "car_with", and taking it into account when tokenizing phrases. I've made a gist of a draft implementation (not implementing all functions)
If there is interest I will do a PR.
The text was updated successfully, but these errors were encountered: