Skip to content

MARCOpo1o/Classifier-GradientBoosting

Repository files navigation

Produce the highest F1 Score

Intuition

I want to find the best pre-processing method and the best model to produce the highest F1 score. I will use the F1 score as the metric to evaluate the performance of the model.

Pre-processing Methods considered

  1. use mode to fill the missing values
  2. use mean to fill the missing values
  3. use median to fill the missing values
  4. use KNN to fill the missing values
  5. use MICE to fill the missing values

Models Selection

Initial Filtering

I first tested 13 different models such as: decision_tree, logistic_reg, knn, naive_bayes, svm_rbf, mlp, xgb, ada_boost, qda, random_forest, gradient_boosting, bagging_tree, ridge_classifier with just the mode pre-processing method. I plan to pick a few best ones and test out the other pre-processing methods on them.

And I found out that the best models are Gradient Boosting, Random Forest, and AdaBoost.

{'Decision Trees': 0.811615573745736, 
'Logistic Regression': 0.6102598274535913, 
'k-NN': 0.6292970132676996, 
'Naive Bayes': 0.6473989565718904, 
'SVM-RBF': 0.0, 
'Neural Networks': 0.43195726295515147, 
'XGBoost': nan, 
'AdaBoost': 0.862096399063032, 
'QDA': 0.6455337976232667, 
'Random Forest': 0.8703740067835521, 
'Gradient Boosting': 0.8895496960256934, 
'Bagging': 0.8591120317627526, 
'Ridge Classifier': 0.598582012704433}

MSE Verification

I then verified these results with the MSE score and found out that the best performing model is XGBoost.

{'Decision Trees': -0.06122657100859856, 
'Logistic Regression': -0.10640665341467315, 
'k-NN': -0.10727138407171308, 
'Naive Bayes': -0.10862819018895513, 
'SVM-RBF': -0.1607208627504741, 
'Neural Networks': -0.15997768486150143, 
'XGBoost': -0.029255832019558115, 
'AdaBoost': -0.042957631701205624, 
'QDA': -0.14800862141187043, 
'Random Forest': -0.03925476576720664, 
'Gradient Boosting': -0.033700124142238067, 
'Bagging': -0.04345183966611069, 
'Ridge Classifier': -0.09986420514695242}

codes shown in model_selection.py

Special Case: XGBoost

However, the f-score was not computing for XGBoost, so I decided to just calculate the F1 score for XGBoost.

I was able to get the result as shown in just_xgboost.py

Average F1 Score for XGBoost: 0.9062059272216227

And that is the highest amongst all the models.

Final Model Selection

Based on the MSE and F1 score, I decided to use XGBoost, Gradient Boosting, and random forest as my final models.

Pre-processing Method Selection

I tested out Mean imputation, Mode imputation, KNN imputation, and simply dropping the missing values on XGBoost, Gradient Boosting, and random forest. I also threw in AdaBoost to see if it performs better than the other models.

However, all of these models performed the same regardless of the pre-processing method.

Here is a table of the results:

Mean imputation Mode imputation KNN imputation drop missing values
XGBoost 0.9062059272216227 0.9062059272216227 0.9062059272216227 0.9062059272216227
Gradient Boosting 0.9801164258901587 0.9801164258901587 0.9801164258901587 0.9801164258901587
AdaBoost 0.974557779568498 0.974557779568498 0.974557779568498 0.974557779568498
Random Forest 0.9768700784256525 0.9768700784256525 0.9768700784256525 0.9768700784256525

codes shown in preprocess_selection.py

Final Decision

Since all the models performed the same regardless of the pre-processing method, I decided to use the simplest method, which is mode imputation. And I will use Gradient Boosting as my final model since it has the highest F1 score.

Final Result

The result is shown in P1_test_output.csv, and it is generated by running result_generation.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages