Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data frame columns #1

Open
mczyzj opened this issue Sep 14, 2018 · 1 comment
Open

data frame columns #1

mczyzj opened this issue Sep 14, 2018 · 1 comment

Comments

@mczyzj
Copy link

mczyzj commented Sep 14, 2018

Do we also include words that have several meanings like bitc*? Also I think that for polish language is quite important to include different forms of particular words.

@pdrhlik
Copy link
Owner

pdrhlik commented Sep 15, 2018

I would include these for now. I have already done that. It's a more complicated contextual problem that I wouldn't try to solve right now. One option for the future would be to create a second column that would just be saying whether the word has a non-offensive meaning. Users could then filter the data frame just for the really offensive ones.

For the second part of your question. Providing all the word forms is one option but I don't think it's the right one. Our datasets would grow really large. The Czech language has the same problem. Each of our nouns can have up to 14 different forms (7 as singular, 7 as plural). And this must be the same for many other languages.

I think a better way would be to provide convenient wrappers for stemmers and/or lemmatisers. There are already some nice working R packages that do this. SnowballC or koRpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants