This is repository of project: "Methods of detecting anomalies for improving the quality of machine learning algorithms."
Due to the inability of anomaly detection algorithms to work with categorical values, NaNs: encoding and imputation models have been presented. The usage of these methods is necessary for the process of anomaly detection algorithms - the study of the effect of these methods is the purpose of this research.
Total number of combinations of encoding, imputation and anomaly detection models: 111.
The performance of the methods is considered by the effect on the final RMSE when solving the regression problem by LightGBM. The RMSE mean and standard deviation of 10-fold Cross-Validation are used.
Evaluation metrics were calculated on: "house-prices-advanced-regression-techniques". Missing values: 6965 / 115340, (6.039%). The following ratio of "NaN" data are used:
- 6% of missed data (Initial dataset)
- 30% of missed data
- 50% of missed data
(In dataset "house-prices-advanced-regression-techniques" there are missing data in each row. Complete-Case Analysis is not available.)
The results of the experiments are available in the file: house-prices-scheme-experiments.xlsx
Best combinations for anomalies detection algorithms:
RMSE | The importance of the anomaly indicator | The importance of the anomaly indicator |
---|---|---|
6% of missing data (Initial) | ||
1. Default model | 27582.549 (5511.916) | - |
2. Target Encoding + Median imputation | 27535.075 (5445.082) | - |
3. Elliptic Envelope (Target Encoding + Median imputation) | 27517.042 (5544.918) | 0.0000518 (0.0000541) |
4. One-Class SVM (Target Encoding + Median imputation) | 27467.854 (5394.370) | -0.0000723 (0.0001001) |
5. Isolation Forest (Target Encoding + Mode imputation) | 27451.540 (5365.327) | -0.0016849 (0.0059135) |
6. LocalOutlierFactor( Target Encoding + K-NN) | 27537.623 (5437.554) | 0.0000096 (0.0000337) |
7. (Best) MICE | 27391.502 (5490.600) | - |
30% of missing data | ||
1. Default model | 33165.202 (4617.468) | - |
2. Target Encoding + MICE | 30188.023 (5813.186) | - |
3. Elliptic Envelope (Target Encoding + MICE) | 30094.619 (5741.747) | -0.0003155 (0.0000924) |
4. One-Class SVM(Target Encoding + MICE) | 30142.952 (5766.397) | -0.0004070 (0.0003780) |
5. (Best) Isolation Forest (Target Encoding + MICE) | 30037.433 (5642.734) | -0.0095051 (0.0035107) |
6. LocalOutlierFactor (Target Encoding + MICE) | 30108.693 (5764.039) | -0.0000030 (0.0000933) |
50% of missing data | ||
1. Default model | 40571.481 (5936.180) | - |
2. Target Encoding + MICE | 35892.779 (5715.790) | - |
3. Elliptic Envelope (Target Encoding + MICE) | 35725.265 (5618.712) | 0.0001082 (0.0001855) |
4. (Best) One-Class SVM(Target Encoding + MICE) | 35629.558 (5657.682) | -0.0003512 (0.0002492) |
5. Isolation Forest (Target Encoding + MICE) | 35720.589 (5642.171) | -0.0181765 (0.0067545) |
6. LocalOutlierFactor (Target Encoding + MICE) | 35775.763 (5683.544) | 0.0001991 (0.0001781) |
- Using the pipeline described above is beneficial because the RMSE values are improving in all cases.
- Best preprocessing combination: Target Encoding + MICE
- Anomaly detection methods reduce RMSE compared to cases with default data / encoding & imputation only.
- Difference in previous & current ways to process categorical features
(Unclear) With the data being dropped, there is no increase in the std of RMSE. Does it mean that the dataset has enough data and well described distributions?