This project classifies diseases in grape plant using various Machine Learning classification algorithms. Grape plants are susceptible to various diseases The diseases that are classified in this project are:
- Black rot
- Black Measles (esca)
- Powdery mildew
- Leaf blight
- Healthy
The Machine learning classification models in this project includes:
- Random forest classification
- Support vector machine classification
- CNN - VGG16
- CNN - Custom
- Ensemble model - Majority voting
- Ensemble model - Stacked prediction
- Clone the project.
- Install packages required.
- Download the data set
- Run the project.
-
Clone project from GitHub.
Change to the directory Grape-Disease-Classification. -
Install packages
In order to reproduce the code install the packages-
Manually install packages mentioned in requirements.txt file or use the command.
pip install -r requirements.txt
-
Install packages using setup.py file.
python setup.py install
The --user option directs setup.py to install the package in the user site-packages directory for the running Python. Alternatively, you can use the --home or --prefix option to install your package in a different location (where you have the necessary permissions)
-
-
Download the required data set.
The data set that is used in this project is available here. The data set includes images from kaggle grape disease data set and the images collected online and labelled using the LabelMe tool.
Download the zip file and extract the files in data/raw folder.[OR]
Run the below command
./wgetgdrive.sh <drive_id> <zip_name>.zip
drive_id is 1gsUyWEkxz9H1-yn2ONx4scHg88kWU-38
Provide any zip_name. -
Run the project.
See Documentation for the code section for further details.
-
Pre processing
This folder contains-
Code to load the images and json(contains labelling information) files. This is present in preprocessing/001_load_data.py. To execute this code, within the 'preprocessing' folder enter the below command
python 001_load_data.py
-
Augment data. The code is present in preproprocessing/002_data_augmentation.py. To execute, run the below command
python 002_data_augmentation.py
The data augmentation techniques used are
- Horizontal flip
- Vertical flip
- Random rotation
- Intensity scaling
- Gamma correction
-
Extract histograms of feature descriptors. Feature descriptors are used to train only random forest and SVM. The code is present in preprocessing/003_hog.py
python 003_hog.py
-
-
Models
This folder contains various models used in this project namely:- Random forest
- Support vector machine
- CNN - VGG16
- CNN - Custom
- Ensemble model - Majority voting
In majority voting technique, output prediction is the one that receives more than half of the votes or the maximum number of votes. If none of the predictions get more than half of the votes or if it is a tie, we may say that the ensemble method could not make a stable prediction for this instance. In such a situation the prediction of the model with the highest accuracy is taken as the final output. - Ensemble model - Stacked prediction
The network is trained with the array of probabilities from all 4 models.
The ensemble models are the aggregation of random forest, SVM, CNN-custom and CNN-VGG16.
The models can be trained by executing the below command within the models folder
python <model_name>.py
-
visualization.py
This file contains all the visualization techniques used in this project.- Confusion matrix, using sns heat map with modifications to display details within each box.
- Loss and Accuracy curves for Neural networks.
- Tree representation for Random forest
- ROC-AUC curves using Yellowbrick.
Usage is as follows
python visualization.py -m <model_name> -t <one_visualization_technique>
For help on available models and visualization techniques
python visualization.py --help
-
app.py
This file predicts the disease of the input image. Usage is as followspython app.py -m <model_name> -i <test_image_index>
for help on usage
python app.py --help
Below are the results obtained on the test set for various models trained in the project.
NOTE
The results obtained are system specific. Due to different combinations of the neural network cudnn library versions and NVIDIA driver library versions, the results can be slightly different. To the best of my knowledge, upon reproducing the environment, the ballpark number will be close to the results obtained.
Models | Accuracy (%) |
---|---|
Random forest | 75.35 |
SVM | 82.89 |
CNN - VGG16 | 93.62 |
Ensemble - Majority voting | 98.05 |
Ensemble - Stacked prediction | 98.23 |
CNN - Custom | 98.76 |