I want to find the best pre-processing method and the best model to produce the highest F1 score. I will use the F1 score as the metric to evaluate the performance of the model.
- use mode to fill the missing values
- use mean to fill the missing values
- use median to fill the missing values
- use KNN to fill the missing values
- use MICE to fill the missing values
I first tested 13 different models such as: decision_tree, logistic_reg, knn, naive_bayes, svm_rbf, mlp, xgb, ada_boost, qda, random_forest, gradient_boosting, bagging_tree, ridge_classifier with just the mode pre-processing method. I plan to pick a few best ones and test out the other pre-processing methods on them.
And I found out that the best models are Gradient Boosting, Random Forest, and AdaBoost.
{'Decision Trees': 0.811615573745736,
'Logistic Regression': 0.6102598274535913,
'k-NN': 0.6292970132676996,
'Naive Bayes': 0.6473989565718904,
'SVM-RBF': 0.0,
'Neural Networks': 0.43195726295515147,
'XGBoost': nan,
'AdaBoost': 0.862096399063032,
'QDA': 0.6455337976232667,
'Random Forest': 0.8703740067835521,
'Gradient Boosting': 0.8895496960256934,
'Bagging': 0.8591120317627526,
'Ridge Classifier': 0.598582012704433}
I then verified these results with the MSE score and found out that the best performing model is XGBoost.
{'Decision Trees': -0.06122657100859856,
'Logistic Regression': -0.10640665341467315,
'k-NN': -0.10727138407171308,
'Naive Bayes': -0.10862819018895513,
'SVM-RBF': -0.1607208627504741,
'Neural Networks': -0.15997768486150143,
'XGBoost': -0.029255832019558115,
'AdaBoost': -0.042957631701205624,
'QDA': -0.14800862141187043,
'Random Forest': -0.03925476576720664,
'Gradient Boosting': -0.033700124142238067,
'Bagging': -0.04345183966611069,
'Ridge Classifier': -0.09986420514695242}
codes shown in model_selection.py
However, the f-score was not computing for XGBoost, so I decided to just calculate the F1 score for XGBoost.
I was able to get the result as shown in just_xgboost.py
Average F1 Score for XGBoost: 0.9062059272216227
And that is the highest amongst all the models.
Based on the MSE and F1 score, I decided to use XGBoost, Gradient Boosting, and random forest as my final models.
I tested out Mean imputation, Mode imputation, KNN imputation, and simply dropping the missing values on XGBoost, Gradient Boosting, and random forest. I also threw in AdaBoost to see if it performs better than the other models.
However, all of these models performed the same regardless of the pre-processing method.
Here is a table of the results:
Mean imputation | Mode imputation | KNN imputation | drop missing values | |
---|---|---|---|---|
XGBoost | 0.9062059272216227 | 0.9062059272216227 | 0.9062059272216227 | 0.9062059272216227 |
Gradient Boosting | 0.9801164258901587 | 0.9801164258901587 | 0.9801164258901587 | 0.9801164258901587 |
AdaBoost | 0.974557779568498 | 0.974557779568498 | 0.974557779568498 | 0.974557779568498 |
Random Forest | 0.9768700784256525 | 0.9768700784256525 | 0.9768700784256525 | 0.9768700784256525 |
codes shown in preprocess_selection.py
Since all the models performed the same regardless of the pre-processing method, I decided to use the simplest method, which is mode imputation. And I will use Gradient Boosting as my final model since it has the highest F1 score.
The result is shown in P1_test_output.csv
, and it is generated by running result_generation.py