Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Phrases model stopword-aware to prevent non-adjacent pairings #1506

Closed
macks22 opened this issue Jul 25, 2017 · 5 comments
Closed

Make Phrases model stopword-aware to prevent non-adjacent pairings #1506

macks22 opened this issue Jul 25, 2017 · 5 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@macks22
Copy link
Contributor

macks22 commented Jul 25, 2017

Description

When using the gensim.models.Phrases model, there is an issue if you want to do stopword filtering. In particular, given a standard list of unigram stopwords, one must filter the stopwords before passing the token stream into the Phrases model. However, if you do this, the Phrases model may build ngrams that pair words that weren't actually adjacent. For instance, if building a trigram Phrases model (two models layered) on the sentence: "new york is a state" and using a stopword list including the words "is" and "a", the sentence would be reduced to "new york state" and the trigram "new_york_state" may be extracted.

A simple fix for this is to allow an optional stopword list to be passed into the Phrases model constructor, replace stopword tokens with None before zipping up the token list with its +1 offsets, then discarding token pairs involving a None.

Versions

This feature should be available to all versions.

@michaelwsherman
Copy link
Contributor

I disagree that this should be part of Phrases. But I do think there is value in gensim handling stop word removal (and other common text preprocessing tasks) with a bit more wrapping.

There's an issue of feature creep within Phrases here--if Phrases does stopword removal, shouldn't Phrases also support the removal of symbols, the conversion of all text to lowercase, removal of posessives, synonym substitution, etc.? All are very standard text preprocessing rules (and I'd argue that symbol removal and case normalizing are done more often than stop word removal before running NLP algorithms, so those would be more natural choices for a starting point).

A better approach might be a separate TextPreprocess class that takes raw text and outputs lists of tokens via a generator, making it naturally compatible with other gensim classes and methods (similar to the Phraser class). This would let you just call your preprocessor and feed the output into whatever gensim model you want:

preprocessor = TextPreprocess(parameters)
bigram = Phrases(preprocessor[sentences])
phraser = Phraser(bigram)
nice_text = phraser[preprocessor[sentences]]

This design allows saving preprocessor rules/schemes, which lets you easily test the efficacy of different preprocessing rules on your downstream task (preprocessing is VERY important for some of the algorithms in gensim, even if preprocessing isn't discussed much in the literature). Running a different preprocessing routine would be as simple as swapping one instance of the TextPreprocess class for another when you build your models. You could also easily perform separate preprocessing rules before and after you find (and replace) ngrams.

There's also maybe a larger discussion here about support in gensim for text processing pipelines. Would this even be in scope for gensim, which is really focused on modeling? Also, there is already a set of preprocessing methods in parsing.preprocessing (that are barely used), maybe these just should just be "upgraded" to work on lists of lists of tokens rather than raw text?

@macks22
Copy link
Contributor Author

macks22 commented Sep 7, 2017

@michaelwsherman I agree on all counts, and I like the idea of packaging up preprocessing in a single class like the TextPreprocess class you discuss. That is a nice alternative to what I've started to contribute to the textcorpus module, which currently combines the responsibilities of loading text files with the responsibilities of preprocessing. It would be good to split those responsibilities apart into something like a TextCorpusLoader and a TextPreprocessor.

However, for this particular ticket, I'm not proposing that Phrases remove stopwords from the underlying corpus. I'm simply proposing to add an optional set of words which it will not form phrases from. The motivation is simple:

no prior preprocessing

In [22]: sentences = ["common the phrase".split() for _ in range(1000)]

In [23]: first_sentence = sentences[0]

In [24]: phrases = gensim.models.Phrases(sentences, scoring='npmi', threshold=0)

In [25]: phrases[first_sentence]
Out[25]: [u'common_the', u'phrase']

with prior preprocessing

In [26]: preprocessed = [[token for token in sent if token not in {'the'}] for sent in sentences]

In [27]: phrases = gensim.models.Phrases(preprocessed, scoring='npmi', threshold=0)

In [28]: phrases[preprocessed[0]]
Out[28]: [u'common_phrase']

As you can see, if I first remove stopwords, I end up with incorrect results. Words that were not actually sequential in my text are seen as sequential because the stopword deletion alters the original word offsets in my input documents. What I'm proposing is not to remove stopwords from the underlying text when training a Phrases model; I'm simply proposing an option to instruct Phrases not to form words from some set into phrases. This way, you could do something like:

sentences = ["common the phrase".split() for _ in range(1000)]
phrases = gensim.models.Phrases(sentences, scoring='npmi', threshold=0, ignore_words={'the'})
phrases[preprocessed[0]] . # [u'common', u'phrase']

@michaelwsherman
Copy link
Contributor

@macks22 I apologize! Upon re-reading your original comment I realize I read it incorrectly.

I see your point, and I agree with you. This is a good idea.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017
@michaelwsherman
Copy link
Contributor

Hey, I was thinking about this a bit and I've thought about a problem.

I'm not sure there are many (if any) stopwords that phrases should be terminated on universally. Most stopwords appear inside of Phrases, especially in titles, domain-specific terms, and common expressions. E.g, "Who's the Boss?", "ready to wear", "lay a foundation" (a meaningful legal phrase that means something very specific), "road to nowhere", "horse and cart", "in the black", etc. I can come up with more.

Faced with this, I would just strip out stop words (which means phrases would still be formed, but without their associated stop words) and assume that things like your "common_the" example would be filtered out by having a low score on whatever concordance metric you choose to use.

@piskvorky
Copy link
Owner

piskvorky commented Oct 10, 2020

Closing because implemented a long time ago – and recently reimplemented & further improved upon in #2976 and #2979.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants