GitHub - WojciechSylwester/Iris_EDA_Model_Comparison: Solving a classification problem with model comparison using learning curves.

🌺 Iris EDA + Model Comparison

Summary

The goal of this project is to classify the Iris species. To achieve this, I understand the data by exploratory data analysis and make insights about it. Baseline models unable me to analyse errors of the models. I handle with these errors by using feature engineering. To achieve better accuracy, I perform hyperparameter tuning. Finally, I check the score using learning curves and compare the models.

Project Description

Technologies

Python
Scikit-Learn
Pandas
Seaborn
Matplotlib
Numpy
Jupyter

Get Data

Dataset derives from Kaggle dataset about Iris Species. The dataset contains data about the length and width of sepal and petal.

https://www.kaggle.com/datasets/uciml/iris

Exploratory Data Analysis

Boxplots shows that one of the three species has different Sepal values. I confirmed that in plot of relationship between the petal width and length

Baseline Models

In this section I train a few different models with their default settings. After that I compare accuracy of these models.

In my process SVC has the highest accuracy - 96%. Behind them are Logistic Regression, KNN and Decision Tree.

Error Analysis

I use confusion matrix to analyse the errors of our models and make better performance of them.

Confusion Matrix Logistic Regression 
[[38  0  0]
 [ 0 39  3]
 [ 0  3 37]]
Confusion Matrix KNN 
[[38  0  0]
 [ 0 40  2]
 [ 0  4 36]]
Confusion Matrix SVM 
[[38  0  0]
 [ 0 40  2]
 [ 0  2 38]]
Confusion Matrix Decision Tree 
[[38  0  0]
 [ 0 38  4]
 [ 0  4 36]]
Confusion Matrix Random Forest 
[[38  0  0]
 [ 0 38  4]
 [ 0  4 36]]

“Iris-setosa” haven't got errors, but “Iris-versicolor” and “Iris-virginica” have a few errors. It corresponds with plots what I make in the EDA section. In order to make better distinguish the last two classes, I use feature engineering.

Feature Engineering

I make new features by raise to the third power Petal features and divide them by other features.

This has resulted in higher performance of the models.

Learning Curve

For a few models, learning curves indicate on a good fit. However, I can make it better during hyperparameter tuning by decrease the bias. Next two models indicate an overfitting. In order to improve performance, high variance should be reduced.

Good Fit

Overfitting

Hyperparameter Tuning

Almost every model achieves a better validation score.

On the left - model with hyperparameter tuning.

On the right - default model.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Iris.csv		Iris.csv
Iris_Eda_Model_Comparison.ipynb		Iris_Eda_Model_Comparison.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌺 Iris EDA + Model Comparison

Summary

Project Description

Technologies

Get Data

Exploratory Data Analysis

Baseline Models

Error Analysis

Feature Engineering

Learning Curve

Good Fit

Overfitting

Hyperparameter Tuning

About

Releases

Packages

Languages

WojciechSylwester/Iris_EDA_Model_Comparison

Folders and files

Latest commit

History

Repository files navigation

🌺 Iris EDA + Model Comparison

Summary

Project Description

Technologies

Get Data

Exploratory Data Analysis

Baseline Models

Error Analysis

Feature Engineering

Learning Curve

Good Fit

Overfitting

Hyperparameter Tuning

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages