Dataset creation for machine learning approaches to spam detection and removal #1009

xenova · 2023-01-30T23:49:58Z

xenova
Jan 30, 2023

Hi all 👋

First of all, I just wanted to say thanks to all who have contributed and are contributing to this tool! It's been super useful!

I just wanted to bring up the idea of possibly accumulating a large database of moderated comments (marked as either VALID, or as SCAM, SPAM, EXPLICIT, and so on). This dataset would then be used to train ML models for improved comment moderation.

Although the rule-based approach taken by this tool has worked really well, I feel like some improvements can be made to be able to catch newer types of spam comments and avoid false positives (like those associated with link spam). That said, of course, the rule-based approach could be used in conjunction with an ML model (or... used to train a model).

Since I have a bit of experience with similar projects, I decided to start doing this myself, first using a variety of unsupervised techniques (mainly clustering using https://www.sbert.net/) and then using a rule-based approach (similar to this repo). I then started moderating the comments (easier because the clustering grouped a lot of similar spam comments together, meaning I could moderated between 100 and 1000 comments at a time).

As of right now, I've accumulated a total of ~4 million comments (on around 5000 random videos), of which ~90 000 have been moderated so far (mainly spam comments).

Here is a sample of the dataset (~25 per category): comments.csv

Not sure if this is something to consider, but I imagine with the many people using the tool, we could procure a large, moderated dataset very quickly.

Looking forward to hearing your responses :)

ThioJoe · 2023-03-09T21:54:05Z

ThioJoe
Mar 9, 2023
Maintainer

You'd probably be able to benefit from the JSON log output features if you weren't aware of it already, that's exactly what it is there for.

1 reply

xenova Mar 9, 2023
Author

Thanks! I suppose the effectiveness of a crowdsourced approach for dataset collection rests on community interest and whether it is feasible to collect and store large amounts of data.

For now, at least, the rule-based approach seems to work well enough. But I think in future, when spam bots become more sophisticated (e.g., generated using large language models), the idea would be more applicable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset creation for machine learning approaches to spam detection and removal #1009

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Dataset creation for machine learning approaches to spam detection and removal #1009

xenova Jan 30, 2023

Replies: 1 comment · 1 reply

ThioJoe Mar 9, 2023 Maintainer

xenova Mar 9, 2023 Author

xenova
Jan 30, 2023

Replies: 1 comment 1 reply

ThioJoe
Mar 9, 2023
Maintainer

xenova Mar 9, 2023
Author