Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newusers: Spam prediction #64

Open
KockaAdmiralac opened this issue Sep 11, 2023 · 0 comments
Open

newusers: Spam prediction #64

KockaAdmiralac opened this issue Sep 11, 2023 · 0 comments
Assignees
Labels
feature New feature or request

Comments

@KockaAdmiralac
Copy link
Owner

Description

Now that the profile classification results are stored in a database, we can use them as a dataset for a machine learning model that can predict whether a profile is spam or not. We can use prediction results to mark profiles with high probability of being spam, and when it receives high enough accuracy (or whatever other metric we decide to look at) use it to auto-report spam profiles to SOAP.

Proposed solution

This task is for tracking the initial implementation of a machine learning model which can be trained on the existent database and achieve good enough results. The procedure is as follows:

  1. Data Collection: Wait for the dataset to grow large enough. As of writing this, there are about ~1000 classified spam profiles and ~6000 classified non-spam profiles, and the system has been running since August 17 (25 days), which is 40 spam profiles per day and 240 non-spam profiles per day. At this rate, there should be about 10000 profiles in about a year. (I'm not sure if we really need to wait that long.)
  2. Feature Extraction: Decide which features from the dataset to use in the model. Regardless of whether we use a neural network for the model or not, most of the profile data we have is in string form which somehow needs to be transformed before being fed into the model.
  3. Training: Create a model and train it on the dataset. Try several different approaches and parameters and see which work best.
  4. Integration: Load the trained model into KockaLogger and show prediction results in the reports channel, putting a mark on those predicted as likely spam (above a certain threshold of certainty.

Notes

I'm not that skilled in machine learning at the time of writing this issue.

@KockaAdmiralac KockaAdmiralac added the feature New feature or request label Sep 11, 2023
@KockaAdmiralac KockaAdmiralac self-assigned this Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant