Skip to content

Analyze and Build a machine learning (ML) model on the Iris Flower dataset

Notifications You must be signed in to change notification settings

murtadapy/iris-flower-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

39 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Iris Flower Classification ๐ŸŒธ

Table of Content ๐Ÿ“•


1. Project Overview ๐Ÿ’ก

In this project, we will analyze the iris flower dataset, which has three species: Setosa, Versicolor and Virginica. Each flower class has around 50 records in the dataset. The main goal of this project is to create a classification model that uses the length and width measurements of the sepal and petal to categorize new flowers.

image

2. Problem Statement ๐Ÿ“Œ

Identifying Iris Flowers by eyes and especially for non-experts is a difficult job, but machine learning algorithms make it much easier to classify any flower with high accuracy. This is a classification problem which the model attempts to determine if the flower was Setosa, Versicolor, or Virginica. In this project, we are going to use Logistic Regression from the scikit-learn library.

3. Metrics ๐Ÿงฎ

In the evaluation process, we are going to use the accuracy score metrics to get an overview on the model performance, which is the number of correctly classified data instances over the total number of data instances. The accuracy score is used above other performance metrics since we want to know how the model performs in general because we don't care much about the specificity or sensitivity in this situation.

image

4. The Iris Flower Dataset ๐ŸŒธ

A. Dataset Source ๐Ÿ“‹

The Iris flower dataset was taken from Kaggle as a comma-separated values (CSV), and it contains a set of 150 records under 5 attributes - Petal Length, Petal Width, Sepal Length, Sepal width and Class(Species).

B. Data Exploration and Data Visualization ๐Ÿ”Ž

The data exploration and data visualization were done inside the /data/process_data.ipynb, but here are some of the findings:

image

As seen above, there are almost 50 records of each flower class in the dataset

image

As it shown above, the sepal range is between 4.3cm and 7.9cm in length and 2.0cm and 4.4cm in width. But the petal range is between 1.0cm and 6.9cm in length and 0.1cm and 2.5cm in width.

The chart also shows that Virginica has the longest sepal length which may reach 7.9cm, as opposed to Setosa, which has a range of 4.3cm to 5.8cm. On the other hand, Setosa has the widest sepals at 4.4cm and Virginica has the highest petal length and width.

5. Methodology ๐Ÿ“œ

The machine learning model was trained on the Iris flower dataset using The scikit learn Python library. The model is Logistic Regression, which is an excellent classifier since it applies the one-vs-rest principle to this multi-class situation. We also used the accuracy score metrics to calculate the model accuracy.

A. Data Preprocessing ๐Ÿ—ƒ๏ธ

The data preprocessing was done inside the /data/process_data.ipynb using Pandas library. There was only one step which is encoding by using Label Encoder from scikit-learn and it converted the flower classes (Setosa, Versicolor and Virginica) to (1, 2 and 3). This process is important because computers deal with numbers better than anything else.

B. Implementation ๐Ÿ“‹

The implementation of algorthims and techniques was done by using the scikit-learn library. This procedure consists of five phases, which are as follows:

  • Loading the data as a pandas dataframe from the database
  • Spliting the dataset to train and test using train test split function
  • building and training the logistic regression model
  • Evaluating the model using the accuracy score
  • Saving the model as a pickle file

C. Refinement ๐Ÿ“ก

In this project, GridSearchCV was used which is an exhaustve search over specified parameter values for an estimator. The following are the hyperparameters that was given to the grid search:

 parameters = {
     'C': [0.1, 1, 10, 100],
     'penalty': ['l1', 'l2', 'elasticnet'],
     'solver': ['lbfgs', 'liblinear'],
     'max_iter': [100, 500]
 }

6. Results ๐Ÿ

A. Model Evaluation and Validation ๐Ÿช„

The model evaluation was calculated using the accuracy score and because the GridSearchCV used the cross validation of five folds to search for the best model possible using the given parameters, it identified the following as the optimal hyperparameters for the robust model that achieved 96% accuracy score:

Best parameters: {'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

B. Justification ๐Ÿ–Š๏ธ

In this project, the grid search was the only strategy used, and we received a high accuracy with the best parameters.

7. Flask Web App ๐ŸŒ

The Flask Web App allows the user to use the trained model to make predictions on new flowers and find their species easily image

8. Files Structure ๐Ÿ“

โ”œโ”€โ”€ app #Website folder
โ”‚ย ย  โ”œโ”€โ”€ app.py #Responsible of running the website
โ”‚ย ย  โ””โ”€โ”€ templates
โ”‚ย ย      โ”œโ”€โ”€ index.html # Allows the user to input and predict new flower properties 
โ”‚ย ย  โ””โ”€โ”€ Static 
โ”‚ย ย      โ”œโ”€โ”€ index.css # This file has the Cascading Style Sheets of the index.html
|
โ”œโ”€โ”€ data
โ”‚ย ย  โ”œโ”€โ”€ dataset.csv # The Iris flower dataset
โ”‚ย ย  โ”œโ”€โ”€ dataset.db #The prepared dataset as SQLite database
โ”‚ย ย  โ””โ”€โ”€ process_data.py #Responsible for dataset preparation
|
โ”œโ”€โ”€ models
โ”‚ย ย  โ”œโ”€โ”€ model.pkl #The Logistic Regression Model
โ”‚ย ย  โ””โ”€โ”€ train_classifier.py #Responsible for creating the machine learning model
|
โ”œโ”€โ”€ images #This folder contains all images for the readme file
โ”‚ย ย  โ”œโ”€โ”€ flower.jpg
|
โ””โ”€โ”€ README.md #Readme file 

9. Requirments ๐Ÿ“‘

In order to run this project, you must have Python3 installed on your machine. You also must have all listed libraries inside the requirments.txt so run the following command to install them:

pip3 install -r requirments.txt

10. Running Process โฏ๏ธ

This secions explains how to run each part of this project using the command prompt or terminal

A. Process Data ๐Ÿ”จ

To look at the data exploration and data visualization, please open /data/process_data.ipynb with Jupyter Notebook.

B. Training the classifier โš™๏ธ

To re-train the classifier, you must go inside the models directory using the terminal or the command prompt and run the following:

python3 train_classifier.py ../data/<database_name>.db <model_name>.pkl

C. Run the Flask Web App ๐ŸŒ

To run the web app, you must go inside the app directory using the terminal or the command prompt and run the following:

python3 app.py

The link of the website will be 0.0.0.0:3001

11. Conclusion ๐Ÿ‘‹

In conclusion, classifying iris flower species may be a challenging task, especially for non-experts, but machine learning algorithms make it much easier to determine the flower class. This project designed a basic but strong machine learning model based on the logistic regression algorithm from the scikit-learn python library. We also ensured that we got the best model possbile by using the gridsearch functionality to get the golden model.

12. Improvements ๐Ÿ†™

We are proud of our solution because it achieved such high accuracy, but there is always room for improvement. In the future, we can attempt to create a deep learning model using neural networks, which may yield even better and more accurate results. You are also welcome to fork this repository and try to enhance the solution on your own.

13. Acknowledgements โค๏ธ

I would like to express my appreciation to Misk Academy and Udacity for the amazing work on the data science course and the support they give us to build this project