Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Norwegian stopwords contain errors #239

Closed
omega opened this issue Jan 15, 2021 · 1 comment
Closed

Norwegian stopwords contain errors #239

omega opened this issue Jan 15, 2021 · 1 comment

Comments

@omega
Copy link

omega commented Jan 15, 2021

I was looking at Sonic, and noticed there was a recent merge of Norwegian stopwords (which is great), but looking at #236 I couldn't help notice some of the words seem to have either wrong or weird encoding, resulting in words that are not Norwegian stopwords. The list seems to originate from the stopwords-iso project, so perhaps this would be better to raise there as well.

    "forsûke",
    "fûr",
    "fûrst",
    "gjûre", 
    "vöre",
    "vört",

The list also contains some words that I don't think should be stopwords (tilstand is one such word).

I am no licensing expert, but by a quick glance, the Norwegian stopwords from nltk (python project) seems to be a slightly smaller, but better, stopwords list for Norwegian. I am not sure if it's ok to just copy theirs though.

So to sum up, would it be best to open a PR fixing the words that are wrong in the list here, or better to open a PR using the stopwords listed in nltk?

@valeriansaliou
Copy link
Owner

Hello there! Thanks for that report. You may open a PR there 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants