data frame columns #1

mczyzj · 2018-09-14T22:09:35Z

Do we also include words that have several meanings like bitc*? Also I think that for polish language is quite important to include different forms of particular words.

pdrhlik · 2018-09-15T06:34:45Z

I would include these for now. I have already done that. It's a more complicated contextual problem that I wouldn't try to solve right now. One option for the future would be to create a second column that would just be saying whether the word has a non-offensive meaning. Users could then filter the data frame just for the really offensive ones.

For the second part of your question. Providing all the word forms is one option but I don't think it's the right one. Our datasets would grow really large. The Czech language has the same problem. Each of our nouns can have up to 14 different forms (7 as singular, 7 as plural). And this must be the same for many other languages.

I think a better way would be to provide convenient wrappers for stemmers and/or lemmatisers. There are already some nice working R packages that do this. SnowballC or koRpus.

pdrhlik mentioned this issue Oct 2, 2018

Added french swear words used in Québec, Canada. #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data frame columns #1

data frame columns #1

mczyzj commented Sep 14, 2018

pdrhlik commented Sep 15, 2018

data frame columns #1

data frame columns #1

Comments

mczyzj commented Sep 14, 2018

pdrhlik commented Sep 15, 2018