-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Phrases
model stopword-aware to prevent non-adjacent pairings
#1506
Comments
I disagree that this should be part of Phrases. But I do think there is value in gensim handling stop word removal (and other common text preprocessing tasks) with a bit more wrapping. There's an issue of feature creep within Phrases here--if Phrases does stopword removal, shouldn't Phrases also support the removal of symbols, the conversion of all text to lowercase, removal of posessives, synonym substitution, etc.? All are very standard text preprocessing rules (and I'd argue that symbol removal and case normalizing are done more often than stop word removal before running NLP algorithms, so those would be more natural choices for a starting point). A better approach might be a separate TextPreprocess class that takes raw text and outputs lists of tokens via a generator, making it naturally compatible with other gensim classes and methods (similar to the Phraser class). This would let you just call your preprocessor and feed the output into whatever gensim model you want:
This design allows saving preprocessor rules/schemes, which lets you easily test the efficacy of different preprocessing rules on your downstream task (preprocessing is VERY important for some of the algorithms in gensim, even if preprocessing isn't discussed much in the literature). Running a different preprocessing routine would be as simple as swapping one instance of the TextPreprocess class for another when you build your models. You could also easily perform separate preprocessing rules before and after you find (and replace) ngrams. There's also maybe a larger discussion here about support in gensim for text processing pipelines. Would this even be in scope for gensim, which is really focused on modeling? Also, there is already a set of preprocessing methods in |
@michaelwsherman I agree on all counts, and I like the idea of packaging up preprocessing in a single class like the However, for this particular ticket, I'm not proposing that no prior preprocessing
with prior preprocessing
As you can see, if I first remove stopwords, I end up with incorrect results. Words that were not actually sequential in my text are seen as sequential because the stopword deletion alters the original word offsets in my input documents. What I'm proposing is not to remove stopwords from the underlying text when training a
|
@macks22 I apologize! Upon re-reading your original comment I realize I read it incorrectly. I see your point, and I agree with you. This is a good idea. |
Hey, I was thinking about this a bit and I've thought about a problem. I'm not sure there are many (if any) stopwords that phrases should be terminated on universally. Most stopwords appear inside of Phrases, especially in titles, domain-specific terms, and common expressions. E.g, "Who's the Boss?", "ready to wear", "lay a foundation" (a meaningful legal phrase that means something very specific), "road to nowhere", "horse and cart", "in the black", etc. I can come up with more. Faced with this, I would just strip out stop words (which means phrases would still be formed, but without their associated stop words) and assume that things like your "common_the" example would be filtered out by having a low score on whatever concordance metric you choose to use. |
Description
When using the
gensim.models.Phrases
model, there is an issue if you want to do stopword filtering. In particular, given a standard list of unigram stopwords, one must filter the stopwords before passing the token stream into thePhrases
model. However, if you do this, thePhrases
model may build ngrams that pair words that weren't actually adjacent. For instance, if building a trigramPhrases
model (two models layered) on the sentence: "new york is a state" and using a stopword list including the words "is" and "a", the sentence would be reduced to "new york state" and the trigram "new_york_state" may be extracted.A simple fix for this is to allow an optional stopword list to be passed into the
Phrases
model constructor, replace stopword tokens withNone
before zipping up the token list with its +1 offsets, then discarding token pairs involving aNone
.Versions
This feature should be available to all versions.
The text was updated successfully, but these errors were encountered: