Common contracted forms are missing from the English stop word list #22

DavidNemeskey · 2015-02-10T13:39:17Z

While the list contains s and t (most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.

d as in she'd,
ll as in we'll,
m as in I'm,
o as in o'clock,
re as in you're,
ve as in they've,
y as in y'all
are missing.

Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g. ain (but don is there).

Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive s), so it does not make sense to not include the rest.

The text was updated successfully, but these errors were encountered:

DavidNemeskey · 2015-02-10T13:44:56Z

This issue can be solved by appending the following list of words to the English stop word list:

d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn

Unfortunately, I don't know how to contribute data changes to this project; opening a PR for a zip file feels a bit strange.

stevenbird · 2016-03-02T06:13:56Z

Thanks @DavidNemeskey, and sorry for the long delay.

aellenhicks · 2016-10-27T17:17:53Z

Why is 'ma' on the list? I tried searching for contractions with 'ma' and only came up with 'ma'am'.

tsolakghukasyan · 2016-10-27T17:48:55Z

@aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think.

aellenhicks · 2016-10-27T17:59:55Z

Thanks!

From: Tsolak Ghukasyan <notifications@github.com mailto:notifications@github.com>
Reply-To: nltk/nltk_data <reply@reply.github.com mailto:reply@reply.github.com>
Date: Thursday, October 27, 2016 at 1:48 PM
To: nltk/nltk_data <nltk_data@noreply.github.com mailto:nltk_data@noreply.github.com>
Cc: aellenhicks <aellenhicks@gmail.com mailto:aellenhicks@gmail.com>, Mention <mention@noreply.github.com mailto:mention@noreply.github.com>
Subject: Re: [nltk/nltk_data] Common contracted forms are missing from the English stop word list (#22)

@aellenhickshttps://github.com/aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/22#issuecomment-256719357, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHyM2VpeTguNbTdx152DExv3BZaPoXKOks5q4OQJgaJpZM4DeUWp.

tenstriker · 2016-11-09T19:14:29Z

why would "won" be part of english stop word? Seems incorrect way to separate out "won" and "t"

DavidNemeskey · 2016-11-09T19:24:45Z

@tenstriker I completely agree, "won" is a meaningful word; I should not have added it to the list.

Maybe instead of a stop word list, an ngram-based detection would be better, but I don't know if Nltk has that.

stevenbird self-assigned this Feb 28, 2016

stevenbird added this to the 3.2 milestone Feb 28, 2016

stevenbird closed this as completed in d3abad8 Mar 2, 2016

burakkose mentioned this issue Mar 22, 2016

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover apache/spark#11871

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common contracted forms are missing from the English stop word list #22

Common contracted forms are missing from the English stop word list #22

DavidNemeskey commented Feb 10, 2015

DavidNemeskey commented Feb 10, 2015

stevenbird commented Mar 2, 2016

aellenhicks commented Oct 27, 2016

tsolakghukasyan commented Oct 27, 2016

aellenhicks commented Oct 27, 2016

tenstriker commented Nov 9, 2016

DavidNemeskey commented Nov 9, 2016

Common contracted forms are missing from the English stop word list #22

Common contracted forms are missing from the English stop word list #22

Comments

DavidNemeskey commented Feb 10, 2015

DavidNemeskey commented Feb 10, 2015

stevenbird commented Mar 2, 2016

aellenhicks commented Oct 27, 2016

tsolakghukasyan commented Oct 27, 2016

aellenhicks commented Oct 27, 2016

tenstriker commented Nov 9, 2016

DavidNemeskey commented Nov 9, 2016