Produce the highest F1 Score

Intuition

I want to find the best pre-processing method and the best model to produce the highest F1 score. I will use the F1 score as the metric to evaluate the performance of the model.

Pre-processing Methods considered

use mode to fill the missing values
use mean to fill the missing values
use median to fill the missing values
use KNN to fill the missing values
use MICE to fill the missing values

Models Selection

Initial Filtering

I first tested 13 different models such as: decision_tree, logistic_reg, knn, naive_bayes, svm_rbf, mlp, xgb, ada_boost, qda, random_forest, gradient_boosting, bagging_tree, ridge_classifier with just the mode pre-processing method. I plan to pick a few best ones and test out the other pre-processing methods on them.

And I found out that the best models are Gradient Boosting, Random Forest, and AdaBoost.

{'Decision Trees': 0.811615573745736, 
'Logistic Regression': 0.6102598274535913, 
'k-NN': 0.6292970132676996, 
'Naive Bayes': 0.6473989565718904, 
'SVM-RBF': 0.0, 
'Neural Networks': 0.43195726295515147, 
'XGBoost': nan, 
'AdaBoost': 0.862096399063032, 
'QDA': 0.6455337976232667, 
'Random Forest': 0.8703740067835521, 
'Gradient Boosting': 0.8895496960256934, 
'Bagging': 0.8591120317627526, 
'Ridge Classifier': 0.598582012704433}

MSE Verification

I then verified these results with the MSE score and found out that the best performing model is XGBoost.

{'Decision Trees': -0.06122657100859856, 
'Logistic Regression': -0.10640665341467315, 
'k-NN': -0.10727138407171308, 
'Naive Bayes': -0.10862819018895513, 
'SVM-RBF': -0.1607208627504741, 
'Neural Networks': -0.15997768486150143, 
'XGBoost': -0.029255832019558115, 
'AdaBoost': -0.042957631701205624, 
'QDA': -0.14800862141187043, 
'Random Forest': -0.03925476576720664, 
'Gradient Boosting': -0.033700124142238067, 
'Bagging': -0.04345183966611069, 
'Ridge Classifier': -0.09986420514695242}

codes shown in model_selection.py

Special Case: XGBoost

However, the f-score was not computing for XGBoost, so I decided to just calculate the F1 score for XGBoost.

I was able to get the result as shown in just_xgboost.py

Average F1 Score for XGBoost: 0.9062059272216227

And that is the highest amongst all the models.

Final Model Selection

Based on the MSE and F1 score, I decided to use XGBoost, Gradient Boosting, and random forest as my final models.

Pre-processing Method Selection

I tested out Mean imputation, Mode imputation, KNN imputation, and simply dropping the missing values on XGBoost, Gradient Boosting, and random forest. I also threw in AdaBoost to see if it performs better than the other models.

However, all of these models performed the same regardless of the pre-processing method.

Here is a table of the results:

	Mean imputation	Mode imputation	KNN imputation	drop missing values
XGBoost	0.9062059272216227	0.9062059272216227	0.9062059272216227	0.9062059272216227
Gradient Boosting	0.9801164258901587	0.9801164258901587	0.9801164258901587	0.9801164258901587
AdaBoost	0.974557779568498	0.974557779568498	0.974557779568498	0.974557779568498
Random Forest	0.9768700784256525	0.9768700784256525	0.9768700784256525	0.9768700784256525

codes shown in preprocess_selection.py

Final Decision

Since all the models performed the same regardless of the pre-processing method, I decided to use the simplest method, which is mode imputation. And I will use Gradient Boosting as my final model since it has the highest F1 score.

Final Result

The result is shown in P1_test_output.csv, and it is generated by running result_generation.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.DS_Store		.DS_Store
P1_test_output.csv		P1_test_output.csv
data_test.csv		data_test.csv
data_training.csv		data_training.csv
just_xgBoost.py		just_xgBoost.py
model_selection.py		model_selection.py
preprocess_selection.py		preprocess_selection.py
readme.md		readme.md
result_generation.py		result_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Produce the highest F1 Score

Intuition

Pre-processing Methods considered

Models Selection

Initial Filtering

MSE Verification

Special Case: XGBoost

Final Model Selection

Pre-processing Method Selection

Final Decision

Final Result

About

Releases

Packages

Languages

MARCOpo1o/Classifier-GradientBoosting

Folders and files

Latest commit

History

Repository files navigation

Produce the highest F1 Score

Intuition

Pre-processing Methods considered

Models Selection

Initial Filtering

MSE Verification

Special Case: XGBoost

Final Model Selection

Pre-processing Method Selection

Final Decision

Final Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages