You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that the profile classification results are stored in a database, we can use them as a dataset for a machine learning model that can predict whether a profile is spam or not. We can use prediction results to mark profiles with high probability of being spam, and when it receives high enough accuracy (or whatever other metric we decide to look at) use it to auto-report spam profiles to SOAP.
Proposed solution
This task is for tracking the initial implementation of a machine learning model which can be trained on the existent database and achieve good enough results. The procedure is as follows:
Data Collection: Wait for the dataset to grow large enough. As of writing this, there are about ~1000 classified spam profiles and ~6000 classified non-spam profiles, and the system has been running since August 17 (25 days), which is 40 spam profiles per day and 240 non-spam profiles per day. At this rate, there should be about 10000 profiles in about a year. (I'm not sure if we really need to wait that long.)
Feature Extraction: Decide which features from the dataset to use in the model. Regardless of whether we use a neural network for the model or not, most of the profile data we have is in string form which somehow needs to be transformed before being fed into the model.
Training: Create a model and train it on the dataset. Try several different approaches and parameters and see which work best.
Integration: Load the trained model into KockaLogger and show prediction results in the reports channel, putting a mark on those predicted as likely spam (above a certain threshold of certainty.
Notes
I'm not that skilled in machine learning at the time of writing this issue.
The text was updated successfully, but these errors were encountered:
Description
Now that the profile classification results are stored in a database, we can use them as a dataset for a machine learning model that can predict whether a profile is spam or not. We can use prediction results to mark profiles with high probability of being spam, and when it receives high enough accuracy (or whatever other metric we decide to look at) use it to auto-report spam profiles to SOAP.
Proposed solution
This task is for tracking the initial implementation of a machine learning model which can be trained on the existent database and achieve good enough results. The procedure is as follows:
Notes
I'm not that skilled in machine learning at the time of writing this issue.
The text was updated successfully, but these errors were encountered: