-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common contracted forms are missing from the English stop word list #22
Comments
This issue can be solved by appending the following list of words to the English stop word list:
Unfortunately, I don't know how to contribute data changes to this project; opening a PR for a zip file feels a bit strange. |
Thanks @DavidNemeskey, and sorry for the long delay. |
Why is 'ma' on the list? I tried searching for contractions with 'ma' and only came up with 'ma'am'. |
@aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think. |
Thanks! From: Tsolak Ghukasyan <notifications@git.luolix.topmailto:notifications@github.com> @aellenhickshttps://github.com/aellenhicks also in "gran'ma", "Im'ma", "I'ma", I think. You are receiving this because you were mentioned. |
why would "won" be part of english stop word? Seems incorrect way to separate out "won" and "t" |
@tenstriker I completely agree, "won" is a meaningful word; I should not have added it to the list. Maybe instead of a stop word list, an ngram-based detection would be better, but I don't know if Nltk has that. |
While the list contains
s
andt
(most likely because they can occur after an apostrophe as part of a contraction in e.g. dog's and can't), other common forms, i.e.d
as in she'd,ll
as in we'll,m
as in I'm,o
as in o'clock,re
as in you're,ve
as in they've,y
as in y'allare missing.
Also missing are the parts of these contractions that fall to the left of the apostrophe, e.g.
ain
(butdon
is there).Of course, the lack of these forms could be justified by pointing out that if the tokenizer does not split by apostrophes, then these forms will not occur in the tokenized text. However, it is a strong assumption, especially taking into account that nltk's own Punkt tokenizer, for instance, does split at the apostrophes. Also, some of the contractions seem to be handled (don't , can't, the possessive
s
), so it does not make sense to not include the rest.The text was updated successfully, but these errors were encountered: