Adding common-words to Phrases #1258

alexgarel · 2017-04-04T14:03:35Z

I have a proof of concept of Phrases class managing stop words, but before doing a pull request, I would be glad to know if there is interest and how to integrate it.

That is currently if you are searching to reveal ngrams like "car with driver" and "car without driver", you can either remove stop words before processing, but you will only find "car driver", or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly).

Taking inspiration from elasticsearch and its common gram filter I have an implementation which can handle stop words to find those expressions. It does it by registering "car_with_driver" in the vocab instead of "car_with", and taking it into account when tokenizing phrases. I've made a gist of a draft implementation (not implementing all functions)

Is there interest in such a solution ?
should I provide a new class handling that, or should I modify existing class to accept a stopwords parameter (empty by default) ?

If there is interest I will do a PR.

gojomo · 2017-04-04T18:43:50Z

Sounds useful, so I would welcome it as an option.

If adding to the existing class could be a small change, and it leaves behavior of 'classic' mode unchanged when not activated, that seems an OK way to add it. But I know there may be some other efforts in progress to optimize (or Cythonize) Phrases – so @tmylk may have other preferences about a 1st implementation. (I suppose the other alternative would be a separate class, eg PhrasesWithCommonWords, that starts as a direct copy but then adds the new functionality – which could also help make the changes clear for a later merge.)

Looking at the description at the ElasticSearch link, I wonder:

should the words-handled-specially be called common_words instead of stop_words, to match their practice?
does the postprocessed text then include the same word in both its combined and uncombined forms, as in the examples there? (That'd be an important behavior to make clear to users, as it would change context-window-sensitive analysis that could come later, as in Word2Vec/Doc2Vec.)

tmylk · 2017-04-04T22:53:30Z

It is a needed functionality and a Pure Python implementation is a good place to start. Please make it optional though.

alexgarel · 2017-04-05T12:21:58Z

Hello, thanks for the comments, I will begin by providing a pure python stand-alone implementation.

@gojomo, +1 to use common_words as a name.
For your seconde point, I 'm not sure I understand the question !
phrase[["we", "provide", "car", "with", "driver"]] would yield:
["we", "provide", "car with driver"]

gojomo · 2017-04-05T16:52:34Z

Per the example on the elasticsearch page (about "the quick brown is a fox"), I would expect:

input: ["we", "provide", "car", "with", "driver"]

…to yield…

output: ["we", "provide", "car", "car_with", "with", "with_driver", "driver"]

That might be ideal for search-indexing, and some gensim users, but would be unexpected (with unclear implications) for something like Word2Vec neighboring-word context-windows.

alexgarel · 2017-04-05T17:34:15Z

@gojomo - clearly I just draw inspiration from the common word filter, but it is an adaptation (yielding only car_with_driver)

gojomo · 2017-04-06T17:56:39Z

Despite the difficulties for later order-sensitive windows, the elasticsearch approach seems potentially more valuable to me, in that all possibilities are generated, then only some might survive some later frequency- or salience-check. Combining the common-word on both-sides, always, seems likely to create overlong phrases.

For example, what would (or should) happen in longer common-uncommon-common-uncommon-etc patterns? For example, assuming each of the * words are 'common', does...

 We're having *a sale *on *the hats *with orange *and green spots.

...become just the 3 tokens...

We're having_a_sale_on_the_hats_with_orange_and_green spots.

?

rpedela · 2017-04-11T05:06:16Z

What about bigrams where one of the words is a stopword and it is actually a phrase? This happens in legal documents such as "Side A" or "Exhibit A". Would the CommonGrams Phraser help with that case?

alexgarel · 2017-04-11T07:23:02Z

@rpedela in the implementation I proposed in #1263 no. Common grams are just considered between too normal words.

gojomo changed the title ~~Adding stopwords to Phrases~~ Adding common-words to Phrases Apr 5, 2017

alexgarel mentioned this issue Apr 6, 2017

Add common terms phrases model #1263

Closed

gojomo mentioned this issue May 25, 2017

models.Phrases only supports word2vec paper's method of scoring potential n-grams #1363

Closed

This was referenced Sep 5, 2017

adding common terms to phrases model #1567

Closed

Adding common terms to phrases model #1568

Merged

menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 2, 2017

menshikh-iv closed this as completed in b4515e0 Nov 1, 2017

gojomo mentioned this issue Oct 8, 2020

[MRG] Refactor phrases #2976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding common-words to Phrases #1258

Adding common-words to Phrases #1258

alexgarel commented Apr 4, 2017

gojomo commented Apr 4, 2017

tmylk commented Apr 4, 2017

alexgarel commented Apr 5, 2017

gojomo commented Apr 5, 2017

alexgarel commented Apr 5, 2017

gojomo commented Apr 6, 2017

rpedela commented Apr 11, 2017

alexgarel commented Apr 11, 2017

Adding common-words to Phrases #1258

Adding common-words to Phrases #1258

Comments

alexgarel commented Apr 4, 2017

gojomo commented Apr 4, 2017

tmylk commented Apr 4, 2017

alexgarel commented Apr 5, 2017

gojomo commented Apr 5, 2017

alexgarel commented Apr 5, 2017

gojomo commented Apr 6, 2017

rpedela commented Apr 11, 2017

alexgarel commented Apr 11, 2017