Skip to content

This project combines meticulous data preprocessing-visualization-machine learning techniques, featuring Decision Tree, integrating Logistic Regression. Prioritizes model interpretability-accuracy through feature selection, optimizing performance evaluation for species classification using sepal & petal features.

Notifications You must be signed in to change notification settings

DA-Atharv/Iris-Species-Classification-and-Model-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

Iris-Species-Classification-Building-ML-Models:

Table of Contents:

Introduction:

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

Data Collection:

The dataset is collected from Kaggle[https://www.kaggle.com/uciml/iris]. This dataset consists of 3 categories of species which is setosa, versicolor and virginica. Each iris species consists of 50 samples. The features of iris flower are Sepal Length in cm, Sepal Width in cm, Petal Length in cm and Petal Width in cm. image

Data Processing:

  • Loaded the data.
# Load Iris csv dataset
iris_data = pd.read_csv('../data/iris.csv')

Exploratory Data Analysis (EDA) & Statastics:

  • Let’s group the data by species and do some descriptive statistics:
# Groupby Species for descriptive statistics
iris_data.groupby('species').describe().T

image

  • count shows that there 50 samples for each species.

  • Setosa

    • Average sepal length is 5cm
    • Average sepal width is 3cm
    • Average petal length is 1.5cm
    • Average petal width is 0.25cm
  • Versicolor

    • Average sepal length is 6cm
    • Average sepal width is 2.8cm
    • Average petal length is 4.26cm
    • Average petal width is 1.32cm
  • Virginica

    • Average sepal length is 6.6cm
    • Average sepal width is 3cm
    • Average petal length is 6cm
    • Average petal width is 2cm
  • From the above information,

    • Based on Petal length we can easily classify them as Setosa(1.5cm), Versicolor(4.2cm) and Virginica(6cm).
    • Based on Petal width we can easily classify Setosa(0.25cm) from Versicolor(1.32cm) and Virginica(2cm).
    • Sepal width looks similar for all three species — Setosa(3cm), Versicolor(2.8cm) and Virginica(3cm).
    • Based on Sepal length, there are only small changes on three species (5cm, 6cm and 6.6cm) Since Sepal width looks similar for all the species, we can drop that feature.

Key Visualizations:

Boxplot: It visually compares distributions of sepal length, sepal width, petal length, petal width based on numerical data through their quartiles. image Pairplot: Relationships between variables across multiple dimensions. image
Swarm-Plot: (image Voilin-Plot image

Feature Observations:

image

Splitting the data into training and testing dataset:

train, test = train_test_split(iris_data, test_size = 0.3) # dataset is split into 70% training and 30% testing
print(train.shape)
print(test.shape)

Feature Selection: Use petal & sepalas as features:

Training and testing data for petals and sepals:

petal = iris_data[['petal_length','petal_width','species']]
sepal = iris_data[['sepal_length','sepal_width','species']]

#Iris_Petals:
train_p,test_p = train_test_split(petal, test_size=0.3, random_state=0) 
train_x_p = train_p[['petal_length','petal_width']]
train_y_p = train_p.species

test_x_p = test_p[['petal_length','petal_width']]
test_y_p = test_p.species

#Iris_Sepals:
train_s,test_s = train_test_split(sepal, test_size=0.3, random_state=0) #sepals
train_x_s = train_s[['sepal_length','sepal_width']]
train_y_s = train_s.species

test_x_s = test_s[['sepal_length','sepal_width']]
test_y_s = test_s.species

Logistic Regression:

model = LogisticRegression()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s))
  • The accuracy of the Logistic Regression using Petals is: 0.9777777777777777
  • The accuracy of the Logistic Regression using Sepals is: 0.8222222222222222

Decision Tree:

model=DecisionTreeClassifier()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s))
  • The accuracy of the Decision Tree using Petals is: 0.9555555555555556
  • The accuracy of the Decision Tree using Sepals is: 0.6444444444444445

Conclusion:

  • From the mathematical models i used i can confirm that using petal features gives more accuracy.
  • Further it was validated by the heatmap high correlation between petal length and width than that of sepal length and width.

About

This project combines meticulous data preprocessing-visualization-machine learning techniques, featuring Decision Tree, integrating Logistic Regression. Prioritizes model interpretability-accuracy through feature selection, optimizing performance evaluation for species classification using sepal & petal features.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published