Replies: 1 comment 1 reply
-
You'd probably be able to benefit from the JSON log output features if you weren't aware of it already, that's exactly what it is there for. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all 👋
First of all, I just wanted to say thanks to all who have contributed and are contributing to this tool! It's been super useful!
I just wanted to bring up the idea of possibly accumulating a large database of moderated comments (marked as either VALID, or as SCAM, SPAM, EXPLICIT, and so on). This dataset would then be used to train ML models for improved comment moderation.
Although the rule-based approach taken by this tool has worked really well, I feel like some improvements can be made to be able to catch newer types of spam comments and avoid false positives (like those associated with link spam). That said, of course, the rule-based approach could be used in conjunction with an ML model (or... used to train a model).
Since I have a bit of experience with similar projects, I decided to start doing this myself, first using a variety of unsupervised techniques (mainly clustering using https://www.sbert.net/) and then using a rule-based approach (similar to this repo). I then started moderating the comments (easier because the clustering grouped a lot of similar spam comments together, meaning I could moderated between 100 and 1000 comments at a time).
As of right now, I've accumulated a total of ~4 million comments (on around 5000 random videos), of which ~90 000 have been moderated so far (mainly spam comments).
Here is a sample of the dataset (~25 per category): comments.csv
Not sure if this is something to consider, but I imagine with the many people using the tool, we could procure a large, moderated dataset very quickly.
Looking forward to hearing your responses :)
Beta Was this translation helpful? Give feedback.
All reactions