Make `Phrases` model stopword-aware to prevent non-adjacent pairings #1506

macks22 · 2017-07-25T14:43:08Z

Description

When using the gensim.models.Phrases model, there is an issue if you want to do stopword filtering. In particular, given a standard list of unigram stopwords, one must filter the stopwords before passing the token stream into the Phrases model. However, if you do this, the Phrases model may build ngrams that pair words that weren't actually adjacent. For instance, if building a trigram Phrases model (two models layered) on the sentence: "new york is a state" and using a stopword list including the words "is" and "a", the sentence would be reduced to "new york state" and the trigram "new_york_state" may be extracted.

A simple fix for this is to allow an optional stopword list to be passed into the Phrases model constructor, replace stopword tokens with None before zipping up the token list with its +1 offsets, then discarding token pairs involving a None.

Versions

This feature should be available to all versions.

The text was updated successfully, but these errors were encountered:

michaelwsherman · 2017-09-06T17:32:25Z

I disagree that this should be part of Phrases. But I do think there is value in gensim handling stop word removal (and other common text preprocessing tasks) with a bit more wrapping.

There's an issue of feature creep within Phrases here--if Phrases does stopword removal, shouldn't Phrases also support the removal of symbols, the conversion of all text to lowercase, removal of posessives, synonym substitution, etc.? All are very standard text preprocessing rules (and I'd argue that symbol removal and case normalizing are done more often than stop word removal before running NLP algorithms, so those would be more natural choices for a starting point).

A better approach might be a separate TextPreprocess class that takes raw text and outputs lists of tokens via a generator, making it naturally compatible with other gensim classes and methods (similar to the Phraser class). This would let you just call your preprocessor and feed the output into whatever gensim model you want:

preprocessor = TextPreprocess(parameters)
bigram = Phrases(preprocessor[sentences])
phraser = Phraser(bigram)
nice_text = phraser[preprocessor[sentences]]

This design allows saving preprocessor rules/schemes, which lets you easily test the efficacy of different preprocessing rules on your downstream task (preprocessing is VERY important for some of the algorithms in gensim, even if preprocessing isn't discussed much in the literature). Running a different preprocessing routine would be as simple as swapping one instance of the TextPreprocess class for another when you build your models. You could also easily perform separate preprocessing rules before and after you find (and replace) ngrams.

There's also maybe a larger discussion here about support in gensim for text processing pipelines. Would this even be in scope for gensim, which is really focused on modeling? Also, there is already a set of preprocessing methods in parsing.preprocessing (that are barely used), maybe these just should just be "upgraded" to work on lists of lists of tokens rather than raw text?

macks22 · 2017-09-07T11:32:16Z

@michaelwsherman I agree on all counts, and I like the idea of packaging up preprocessing in a single class like the TextPreprocess class you discuss. That is a nice alternative to what I've started to contribute to the textcorpus module, which currently combines the responsibilities of loading text files with the responsibilities of preprocessing. It would be good to split those responsibilities apart into something like a TextCorpusLoader and a TextPreprocessor.

However, for this particular ticket, I'm not proposing that Phrases remove stopwords from the underlying corpus. I'm simply proposing to add an optional set of words which it will not form phrases from. The motivation is simple:

no prior preprocessing

In [22]: sentences = ["common the phrase".split() for _ in range(1000)]

In [23]: first_sentence = sentences[0]

In [24]: phrases = gensim.models.Phrases(sentences, scoring='npmi', threshold=0)

In [25]: phrases[first_sentence]
Out[25]: [u'common_the', u'phrase']

with prior preprocessing

In [26]: preprocessed = [[token for token in sent if token not in {'the'}] for sent in sentences]

In [27]: phrases = gensim.models.Phrases(preprocessed, scoring='npmi', threshold=0)

In [28]: phrases[preprocessed[0]]
Out[28]: [u'common_phrase']

As you can see, if I first remove stopwords, I end up with incorrect results. Words that were not actually sequential in my text are seen as sequential because the stopword deletion alters the original word offsets in my input documents. What I'm proposing is not to remove stopwords from the underlying text when training a Phrases model; I'm simply proposing an option to instruct Phrases not to form words from some set into phrases. This way, you could do something like:

sentences = ["common the phrase".split() for _ in range(1000)]
phrases = gensim.models.Phrases(sentences, scoring='npmi', threshold=0, ignore_words={'the'})
phrases[preprocessed[0]] . # [u'common', u'phrase']

michaelwsherman · 2017-09-07T19:23:20Z

@macks22 I apologize! Upon re-reading your original comment I realize I read it incorrectly.

I see your point, and I agree with you. This is a good idea.

michaelwsherman · 2017-10-02T18:53:27Z

Hey, I was thinking about this a bit and I've thought about a problem.

I'm not sure there are many (if any) stopwords that phrases should be terminated on universally. Most stopwords appear inside of Phrases, especially in titles, domain-specific terms, and common expressions. E.g, "Who's the Boss?", "ready to wear", "lay a foundation" (a meaningful legal phrase that means something very specific), "road to nowhere", "horse and cart", "in the black", etc. I can come up with more.

Faced with this, I would just strip out stop words (which means phrases would still be formed, but without their associated stop words) and assume that things like your "common_the" example would be filtered out by having a low score on whatever concordance metric you choose to use.

piskvorky · 2020-10-10T19:46:19Z

Closing because implemented a long time ago – and recently reimplemented & further improved upon in #2976 and #2979.

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017

macks22 mentioned this issue Oct 2, 2017

Restructure TextCorpus code to share multiprocessing and preprocessing logic. #1478

Closed

piskvorky closed this as completed Oct 10, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `Phrases` model stopword-aware to prevent non-adjacent pairings #1506

Make `Phrases` model stopword-aware to prevent non-adjacent pairings #1506

macks22 commented Jul 25, 2017

michaelwsherman commented Sep 6, 2017

macks22 commented Sep 7, 2017

michaelwsherman commented Sep 7, 2017

michaelwsherman commented Oct 2, 2017

piskvorky commented Oct 10, 2020 •

edited

Loading

Make Phrases model stopword-aware to prevent non-adjacent pairings #1506

Make Phrases model stopword-aware to prevent non-adjacent pairings #1506

Comments

macks22 commented Jul 25, 2017

Description

Versions

michaelwsherman commented Sep 6, 2017

macks22 commented Sep 7, 2017

michaelwsherman commented Sep 7, 2017

michaelwsherman commented Oct 2, 2017

piskvorky commented Oct 10, 2020 • edited Loading

Make `Phrases` model stopword-aware to prevent non-adjacent pairings #1506

Make `Phrases` model stopword-aware to prevent non-adjacent pairings #1506

piskvorky commented Oct 10, 2020 •

edited

Loading