-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profanity filtering doesn't work for combined words like "f*ckme" or "suckmyd*ck" #18
Comments
Unfortunately, this is a well-known problem that extends beyond just Now, there are some improvements we could make to minimize this issue. Currently, if you include the phrase "suck my d*ck" in a word list, the phrase "suckmyd*ck" won't be censored. It should be fairly straight-forward for us to include variations of censored phrases without whitespace into our censor. |
Running through some math and seeing some potential memory problems. Suppose we want to include every variation of whitespaces for the phrase "suck my d*ck", such that "suckmy d*ck" as well as "suck myd*ck" are all censored. This would require essentially adding a new words for each variation. With only two whitespaces, this only amounts to a total of 4 words. But the more whitespace a phrase has, the more words we need to add. Specifically, the number of words needed increases as a factorial, Now realistically, we could just disallow phrases with more than, say 8 whitespaces (8! = 40320 word variations). Personally, I haven't come across an applicable phrase with more than 5 whitespaces yet. This would keep memory consumption per word at If we instead include only two variations, that with with and that without whitespace (i.e. "suck my d*ck" and "suckmyd*ck"), we'd have no memory concerns to worry about and users could use phrases as long as they'd like. However, again, I see no reason to believe this wouldn't have the potential to manifest the Scunthorpe problem. So in my mind, there are only two solutions that allow us to avoid false-positives:
|
I see, cant we use regex to replace all spaces from that text and put each words seperated with spaces as a element in a list and check if any element in that list matches any word in profanity_wordlist.txt |
As mentioned in #14, regex is extremely slow and it will exponentially increase the runtime with the length of the text. |
@MissJuliaRobot, it's possible we could do what you're thinking without regex. Could you elaborate on the method you're proposing? I'm not sure if I understand completely. |
No description provided.
The text was updated successfully, but these errors were encountered: